# Exercises on Spark Streaming

The objective in this set of exercises is to get comfortable with streaming-style computations using Spark Streaming.

## Frist, a gentle reminder of Spark Streaming


![alt text](https://spark.apache.org/docs/latest/img/streaming-arch.png "")

Spark Streaming is a Spark library that allows streaming computations. 

In essence, Spark Streaming receives input data streams from any type of services (i.e: Kafka, Flume, HDFS, Kinesis, Twitter,...) and divides them into *mini-batches*. Those mini batches are then processed by Spark to build a final stream of results in batches. 

![alt text](https://spark.apache.org/docs/latest/img/streaming-flow.png "")

A continuous sequence of mini-batches is called a DStream. Each mini-batch in this DStream gets represented as an RDD and Spark Streaming provide a high level API that manipulate DStreams. 

![alt text](https://spark.apache.org/docs/latest/img/streaming-dstream.png) 

Any operation that is applied on a DStream translates to operations on the underlying RDDs. If we consider a simple example where the stream of data is a stream of lines of words (i.e: simple sentences for instance), a flatMap operation is applied on each RDD in the lines to generate as output the words DStreams containing a list of the words present in the processed sentence.

![alt text](https://spark.apache.org/docs/latest/img/streaming-dstream-ops.png) 


The Spark Streaming programming guide is available at https://spark.apache.org/docs/latest/streaming-programming-guide.html

# How to use this notebook

Because of the combination of the following elements, how we execute spark streaming jobs is very different from how we experimented with Spark itself in the last exercise session.

1. Streaming computations are computations that never finish (they continuously wait for new data to arrive).
2. We will need to run multiple computations in parallel (1 computation to generate data, 1 to consume data).
3. Jupyter does not allow cells to be executed in parallel.

**Therefore**, while we will run the streaming computations inside of this notebook, you will need to launch a new notebook to launch, in parallel,  the process that generates the data. (The exact instructions may be found below.)

### How to launch spark applications

In the last lab session, we create the Spark Context (necessary to start spark transformations and actions) *inside* of the jupyter notebook itself (reusing the spark context across many cells). Because we will often need to stop the streaming computation (which closes the context), in this session, we will put all the computation inside of a single python script, and launch this script through the command line. There are different ways to do this. 
- If you invoke it by `python <name-of-your-script.py>` the script will be run in local mode (i.e. only on the machine on which you are invoking `<name-of-your-script.py>`). This will work only if pyspark is correctly added to your PYTHONPATH variable.
- Alternatively, you can invoke it by `spark-submit <name-of-your-script.py>`. In this case the script will also be run in local mode by default. This assumes that the spark-submit command (found in the `bin` subfolder of your spark distribution) is in your path. **This is the preferred way, used below.**
- You can also pass arguments to spark-submit. For example `spark-submit --master yarn <name-of-your-script` deploys the script to YARN, which will schedule it on a cluster. This will not be shown here

## 1. A simple Spark streaming example: counting inside a mini-batch

As already mentioned above, Spark Streaming can receive its input from different types of services, inlcuding Kafka, Twitter, Kinesis, ... . The simplest kind of streaming source, however, is the *file system*. In particular, when you set up Spark Streaming to receive data from a specific folder (which can be on your local filesystem, but could also be on HDFS), then it will watch this folder for new files to occur. Every new file will be treated as one mini-batch in the DStream. It is important to note that files that already existed in the watched folder when spark streaming starts will **not be processed**, only new files will be processed!

In this exercise session, we will first use the file system as a source of streaming data. The last section of this notebook has an example of connecting to Kafka.

We next describe the data that we will be using. The `data` subfolder contains a file `data/orders.txt` that contains some historal data of buy and sell orders on a stock exchange. 

In [1]:
# shows the first 5 lines of orders.txt
import headtail
headtail.head('data/orders.txt', 5)

['2016-03-22 20:25:28,1,80,EPE,710,51.00,B\n',
 '2016-03-22 20:25:28,2,70,NFLX,158,8.00,B\n',
 '2016-03-22 20:25:28,3,53,VALE,284,5.00,B\n',
 '2016-03-22 20:25:28,4,14,SRPT,183,34.00,B\n',
 '2016-03-22 20:25:28,5,62,BP,241,36.00,S\n']

Each line has the following fields.
* Order timestamp—Format yyyy-mm-dd hh:MM:ss
* Order ID —Serially incrementing integer
* Client ID —Integer randomly picked from the range 1 to 100
* Stock symbol—Randomly picked from a list of 80 stock symbols
* Number of stocks to be bought or sold—Random number from 1 to 1,000
* Price at which to buy or sell—Random number from 1 to 100
* Character B or S —Whether the event is an order to buy or sell

The contents of ` data/orders.txt` is split into multiple files in the subfolder `data/split`. For example, `data/split/ordersaa.ordtmp` contains the first 1000 lines of `data/orders.txt`; `ordersab.ordtmp` contains the next 1000, and so on.

In [3]:
import headtail
headtail.head('data/split/ordersaa.ordtmp', 5)

['2016-03-22 20:25:28,1,80,EPE,710,51.00,B\n',
 '2016-03-22 20:25:28,2,70,NFLX,158,8.00,B\n',
 '2016-03-22 20:25:28,3,53,VALE,284,5.00,B\n',
 '2016-03-22 20:25:28,4,14,SRPT,183,34.00,B\n',
 '2016-03-22 20:25:28,5,62,BP,241,36.00,S\n']

In [4]:
import headtail
headtail.head('data/split/ordersab.ordtmp', 5)

['2016-03-22 20:25:28,1001,73,AAPL,798,42.00,S\n',
 '2016-03-22 20:25:28,1002,99,NQ,303,50.00,S\n',
 '2016-03-22 20:25:28,1003,29,PBR,988,67.00,B\n',
 '2016-03-22 20:25:28,1004,40,Z,327,27.00,B\n',
 '2016-03-22 20:25:28,1005,96,PBR,587,46.00,B\n']

The python script `scripts/simulateStreamingInput.py` can be used to simulate new data arriving in a streaming fashion. Concretely, it copies the files from `data/split` to  the folder `stream-IN` one by one, with a delay of 3 seconds in-between two files. If we start Spark streaming to monitor the `stream-IN` folder for new files, then the net effect is that every 3 seconds, 1000 lines of stock trade date is made available to Spark Streaming.

**Example 1.1.** The file `1-countPerBatch.py` creates a Spark Streaming job that monitors the folder `stream-IN`for new files. For each mini-batch (which contains the contents of these new files), it will parse each text line in mini-batch into a python dictionary. Next, it computes the total number of lines that contain a *Buy* 
order(last column = 'B') and the total number of lines that contain a *Sell* order (last column = 'F'). Finally, the first 10 lines of each mini-batch are printed on the console

In [8]:
%%file 1-countPerBatch.py
import sys
import os
from datetime import datetime
from pathlib import Path

STREAM_IN = 'stream-IN'
STREAM_OUT = 'stream-OUT'

# We first delete all files from the STREAM_IN folder
# before starting spark streaming.
# This way, all files produced by scripts/simulateStreamingInput are new
print("Deleting existing files in %s ..." % STREAM_IN)
p = Path('.') / STREAM_IN
for f in p.glob("*.ordtmp"):
    os.remove(f)
print("... done")

from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext

sc = SparkContext()      #Default spark context, arguments (e.g, cluster, memory) can be passed through spark-submit
sc.setLogLevel("WARN")   #Make sure warnings and errors observed by spark are printed, but not INFO messages.

# Uncomment the following if you want to see exactly with which arguments the spark context has been created.
#print("----------------- SC CONF------------------")
#print(sc._conf.getAll())
#print("------------------------------------------")

ssc = StreamingContext(sc, 5)  #generate a mini-batch every 5 seconds
filestream = ssc.textFileStream(STREAM_IN) #monitor new files in folder stream-IN

def parseOrder(line):
  '''parses a single line in the orders file into a dictionary'''
  s = line.split(",")
  try:
      if s[6] != "B" and s[6] != "S":
        raise Exception('Wrong format')
      return [{"time": datetime.strptime(s[0], "%Y-%m-%d %H:%M:%S"),
               "orderId": int(s[1]), 
               "clientId": int(s[2]),
               "symbol": s[3], 
               "amount": int(s[4]), 
               "price":  float(s[5]), 
               "buy": s[6] == "B"}]
  except Exception as err:
      print("Wrong line format (%s): %s" % (line,err))
      return [] #ignore this line since it threw an error while parsing

# Convert the input DStream (where each RDD contains lines) into a
# DStream of python dictionaries (where each RDD contains dictionaries) 
# flatMap applies parseOrder on each line in each RDD in
# the DStream, where results are flattened
orders = filestream.flatMap(parseOrder)

from operator import add

# Calculate total number of buy/sell orders (buy -> key = True, sell -> key = False)
# map applies its argument function on each RDD in the DStream
# reduceByKey applies reduceBykey on each RDD in the DStream
numPerType = orders.map(lambda o: (o['buy'], 1)).reduceByKey(add)

# Print the first 10 lines of each RDD computed in the DStream to stdou
# This is usefull for debugging purposes only
numPerType.pprint()

# -----ALTERNATIVE TO PPRINT----
# If instead you want to save each computed RDD to a file, uncomment the following
# This creates a new folder for each RDD computed; inside the folder 1 file for
#  each partition in the rdd is created. To make this easy to inspect, we
# repartition the RDD into 1 single partition (but this is not required).
# ------------------------------
# numPerType.repartition(1).saveAsTextFiles(STREAM_OUT, "txt")

# Now start consuming input and wait forever (or until you press CTRL+C)
# When run from inside jupyter, click on menu Kernel -> Interrupt to press CTRL=C
ssc.start()
ssc.awaitTermination()

Overwriting 1-countPerBatch.py


**Exercise 1.2** Do the following.

1. Inspect the contents of the file `1-countPerBatch.py`. See if you understand what is being done
2. Execute this python script in **local** mode by means of `spark-submit`. In parallel (i.e., in a separate notebook/shell/command line), execute `python scripts/simulateStreamingInput.py` to start copying data to the `stream-IN` folder.
3. You can terminate the Spark Streaming and `simulateStreamingInput` jobs by interrupting your notebook (which the same as pressing CTRL+C on a command line).
4. Modify `1-countPerBatch.py` and uncomment the line that saves every RDD in the DStream to stream-OUT. Re-execute to see what happens

**note:** We run in local mode because we are reading from the local filesystem. If we want to execute it in a distributed fashion, we would need other scripts for generating the data (which construct new files in HDFS instead of on the local filesystem).

In [9]:
# Run the 1-countPerBatch in local mode (i.e., on the current machine)
# (Note: Assumes that spark-submit executable is in your path.)
# Execute 'scripts/simulateStreamingInput.py' in a separate notebook to generate
# the input to this spark streaming file.
!spark-submit --master "local[*]" 1-countPerBatch.py

20/11/18 15:19:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Deleting existing files in stream-IN ...
... done
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/11/18 15:19:29 INFO SparkContext: Running Spark version 2.4.5
20/11/18 15:19:29 INFO SparkContext: Submitted application: 1-countPerBatch.py
20/11/18 15:19:29 INFO SecurityManager: Changing view acls to: bigdata
20/11/18 15:19:29 INFO SecurityManager: Changing modify acls to: bigdata
20/11/18 15:19:29 INFO SecurityManager: Changing view acls groups to: 
20/11/18 15:19:29 INFO SecurityManager: Changing modify acls groups to: 
20/11/18 15:19:29 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(bigdata); groups with view permissions: Set(); users  with modify permissions: Set(bigdata); groups with modify permissions: Set()
20/11/18 15:19:30 INFO Utils: 

---
**Some comments**:

* Just like you need a SparkContext object to construct RDDs, you need a StreamingContext to construct DStreams. StreamingContext are created from an existing SparkContext.
* Only 1 StreamingContext can be executing per JVM, i.e., 1 per spark streaming job
* You can stop a StreamingContext `ssc` by calling `ssc.stop()`. This, however will also close the SparkContext that was used to create it. Call `ssc.stop(False)` to avoid closing the SparkContext (which can then be used to construct a new StreamingContext)
---


**Exercise 1.3** Copy `1-countPerBatch.py` into a file `1.3-countAndVolumePerBatch.py` and modify the latter to output, for each mini-batch, the following pairs:

```
('BUY', total number of buy orders in this minibatch RDD)
('SELL', total number of sell orders in this minibatch RDD)
('BUYVOL', total volume bought in this minibatch RDD)
('SELLVOL', total volume sold in this minibatch RDD)
```
Here, the *volume* of an order is the order's amount times the order's price.

(Hint: create two dstreams, one for the counts and one for the volumes, and union them  with the `union` method of dstreams).

Be sure to test your implementation.

In [None]:
%%file 1.3-countAndVolumePerBatch.py

# Put your solution here!

In [None]:
!spark-submit --master "local[*]" 1.3-countAndVolumePerBatch.py

**Exercise 1.4** Copy `1-countPerBatch.py` into a file `1.4-countAndVolumePerBatch.py` and modify the latter to output, for each mini-batch, the following pairs:
```
('BUY': total number of buy orders in this minibatch)
('SELL': total number of sell orders in this minibatch)
('<userid>': total volume traded (bought or sold) by this user-id in this mini-batch)
```
Where the last pair is repeated for every `<userid>` present in the current minibatch.

In [None]:
%%file 1.4-countAndVolumePerBatch.py

# Put your solution here!



In [None]:
!spark-submit --master "local[*]" 1.4-countAndVolumePerBatch.py

## 2. Aggregating data across mini-batches

Often we need to compute aggregates of data that spans multiple mini-batches. The file `2-totalVolumePerClient.py` will output, for each mini-batch, the following pairs:

```
('BUY': total number of buy orders in this minibatch RDD)
('SELL': total number of sell orders in this minibatch RDD)
('<userid>': total volume traded by this user-id across all mini-batches, present and past)
```
Where the last pair is repeated for every `<userid>` ever encountered and the total volume includes both buys and sells.

It works by using the `updateStateBykey` function of pair DStreams, which allows remember a state (per key) across minibatches. Concretely, updateStateByKey takes as argument function that gets two inputs: the set of new values for the key (in this minibatch) and the old state (which is `None` if the key hasn't been seen before). It needs to output the new state to be maintained. This also becomes part of the output RDD.

In our example, the state is just the current volume bought and sold, i.e., an integer.

In [17]:
%%file 2-totalVolumePerClient.py
import sys
import os
from datetime import datetime
from pathlib import Path

STREAM_IN = 'stream-IN'
STREAM_OUT = 'stream-OUT'

# We first delete all files from the STREAM_IN folder
# before starting spark streaming.
# This way, all files are new
print("Deleting existing files in %s ..." % STREAM_IN)
p = Path('.') / STREAM_IN
for f in p.glob("*.ordtmp"):
    os.remove(f)
print("... done")

from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext

sc = SparkContext()      #Default spark context, arguments (e.g, cluster, memory) can be passed through spark-submit
sc.setLogLevel("WARN")   #Make sure warnings and errors observed by spark are printed.

ssc = StreamingContext(sc, 5)  #generate a mini-batch every 5 seconds
filestream = ssc.textFileStream(STREAM_IN) #monitor new files in folder stream-IN

def parseOrder(line):
  '''parses a single line in the orders file'''
  s = line.split(",")
  try:
      if s[6] != "B" and s[6] != "S":
        raise Exception('Wrong format')
      return [{"time": datetime.strptime(s[0], "%Y-%m-%d %H:%M:%S"),
               "orderId": int(s[1]), 
               "clientId": int(s[2]),
               "symbol": s[3], 
               "amount": int(s[4]), 
               "price":  float(s[5]), 
               "buy": s[6] == "B"}]
  except Exception as err:
      print("Wrong line format (%s): %s" % (line,err))
      return []

orders = filestream.flatMap(parseOrder)

from operator import add

# Calculate total number of buy/sell orders (buy -> key = True, sell -> key = False)
numPerType = orders.map(lambda o: ("BUY", 1) if o['buy'] else ("SELL", 1)).reduceByKey(add)

volumePerClient = orders.map(lambda o: (o['clientId'], o['amount'] * o['price']))
volumeState = volumePerClient.updateStateByKey(lambda vals, totalOpt: sum(vals) + totalOpt if totalOpt != None else sum(vals))

finalStream = numPerType.union(volumeState)
finalStream.pprint(50)

#finalStream.repartition(1).saveAsTextFiles(STREAM_OUT, "txt")

# updateStateByKey requires checkpointing; set the spark checkpoint
# folder to the subfolder of the current folder named "checkpoint"
sc.setCheckpointDir("checkpoint")

ssc.start()
ssc.awaitTermination()

Overwriting 2-totalVolumePerClient.py


**Exercise 2.1**
1. Inspect the contents of the file `2-totalVolumePerclient.py`. See if you understand what is being done
2. Execucte this python script. In parallel (i.e., in a separate shell/command line), execute `scripts/simulateStreamingInput.py` to start copying data to the `stream-IN` folder.
3. You can terminate the Spark Streaming and `simulateStreamingInput` jobs by pressing control+C

In [None]:
!spark-submit --master "local[*]" 2-totalVolumePerClient.py

**Exercise 2.2** Copy `2-totalVolumePerClient.py` into a file `2.2-top5Clients.py` and 
modify the latter to output, for each mini-batch, the user ids of the top 5 clients (i.e., the 5 clients that have the largest buy/sell volume over all orders seen so far).

*Hint*: to calculate the top-5 elements of an RDD you can first sort the RDD (using `sortBy`) and then then take the first 5 elements (first `zipWithIndex` to associate the index to each element, then filter only those elements whose index is less than 5). Note, however, that a DStream is a sequence of RDDs, not a single RDD. So, you need to do this transformation on each rdd in the DStream, which you can do by means of the DStream's `transform()` method (which takes as argument a function that transforms the RDD).


In [None]:
%%file 2.2-top5Clients.py

# Your solution here!

In [None]:
!spark-submit --master "local[*]" 2.2-top5Clients.py

## 3. Time-limited aggregates using windows

Using windowing operations, we can time-limited aggregates. 

**Example 3.1** An example is given in `3-salesPerMinutes.py`, which computes the total number of orders seen in the last minute, with a refresh of this total every 15 seconds.

In [19]:
%%file 3-salesPerMinutes.py
import sys
import os
from datetime import datetime
from pathlib import Path

STREAM_IN = 'stream-IN'
STREAM_OUT = 'stream-OUT'

# We first delete all files from the STREAM_IN folder
# before starting spark streaming.
# This way, all files are new
print("Deleting existing files in %s ..." % STREAM_IN)
p = Path('.') / STREAM_IN
for f in p.glob("*.ordtmp"):
  os.remove(f)
print("... done")

from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext

sc = SparkContext()      #Default spark context, arguments (e.g, cluster, memory) can be passed through spark-submit
sc.setLogLevel("WARN")   #Make sure warnings and errors observed by spark are printed.

ssc = StreamingContext(sc, 5)  #generate a mini-batch every 5 seconds
filestream = ssc.textFileStream(STREAM_IN) #monitor new files in folder stream-IN

def parseOrder(line):
  '''parses a single line in the orders file'''
  s = line.split(",")
  try:
      if s[6] != "B" and s[6] != "S":
        raise Exception('Wrong format')
      return [{"time": datetime.strptime(s[0], "%Y-%m-%d %H:%M:%S"),
               "orderId": int(s[1]), 
               "clientId": int(s[2]),
               "symbol": s[3], 
               "amount": int(s[4]), 
               "price":  float(s[5]), 
               "buy": s[6] == "B"}]
  except Exception as err:
      print("Wrong line format (%s): %s" % (line,err))
      return []

from operator import add
orders = filestream.flatMap(parseOrder)
ordersPerMinute = orders.map(lambda o: 1).window(60, 15) # window length = 60 sec, slide = 15 sec
orderCountPerMinute = ordersPerMinute.reduce(add)
orderCountPerMinute.pprint()

# windows operations requires checkpointing; set the spark checkpoint
# folder to the subfolder of the current folder named "checkpoint"
sc.setCheckpointDir("checkpoint")

ssc.start()
ssc.awaitTermination()



Writing 3-salesPerMinutes.py


**Exercise 3.2**  
1. Inspect the contents of the file `3-salesPerMinute.py`. See if you understand what is being done
2. Execucte this python script. In parallel (i.e., in a separate jupyter notebook/shell/command line), execute `scripts/simulateStreamingInput.py` to start copying data to the `stream-IN` folder. The totals reported should increased during the first minute, and then stabilize. Once stabilized, cancel the `simulateStreamingInput` script; the reported numbers should now start to decrease.
3. You can terminate the Spark Streaming and `simulateStreamingInput` jobs by interrupting Jupyter.

In [20]:
!spark-submit --master "local[*]" 3-salesPerMinutes.py

20/11/18 15:24:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Deleting existing files in stream-IN ...
... done
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/11/18 15:24:40 INFO SparkContext: Running Spark version 2.4.5
20/11/18 15:24:40 INFO SparkContext: Submitted application: 3-salesPerMinutes.py
20/11/18 15:24:40 INFO SecurityManager: Changing view acls to: bigdata
20/11/18 15:24:40 INFO SecurityManager: Changing modify acls to: bigdata
20/11/18 15:24:40 INFO SecurityManager: Changing view acls groups to: 
20/11/18 15:24:40 INFO SecurityManager: Changing modify acls groups to: 
20/11/18 15:24:40 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(bigdata); groups with view permissions: Set(); users  with modify permissions: Set(bigdata); groups with modify permissions: Set()
20/11/18 15:24:41 INFO Utils

**Exercise 3.3** Copy `3-salesPerMinute.py` into a file `3.3-top5Securities.py` and 
modify the latter to compute the top five most traded securities in the last 3minutes, which is updated every 10 seconds.

In [None]:
%%file 3.3-top5Securities.py
# Your solution here!

In [None]:
!spark-submit --master "local[*]" 3.3-top5Securities.py

## 4. A SparkStreaming Example that receives input from Kafka and outputs to Kafka

So far, we have been using Spark Streaming to read from the filesystem, and output to the console or the filesystem. In this final exercise, we will run a spark streaming job that consumes input from Kafka. 



We first Create a new Kafka topic that will be used to receives stock quotes

In [21]:
# Create a new Kafka topic that will be used to receive stock quotes
!kafka-topics.sh  --create --zookeeper localhost:2181 \
    --topic $USER.orders --partitions 5 --replication-factor 1

Created topic bigdata.orders.


Next, we create the script that we will use to analyze the kafka topic. Check that you understand what it does.

In [22]:
%%file 4-salesPerMinuteFromKafka.py
import sys
import os
import pwd
from datetime import datetime

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

sc = SparkContext()
sc.setLogLevel("ERROR")   #Make sure warnings and errors observed by spark are printed.

ssc = StreamingContext(sc, 5)  #generate a mini-batch every 5 seconds
zookeeper = "localhost:2181"
username = pwd.getpwuid( os.getuid() )[ 0 ] 
topic = username + ".orders"
inputStream = KafkaUtils.createStream(ssc, zookeeper,
                                  "raw-event-streaming-consumer", {topic:1})

def parseOrder(line):
  '''parses a single line in the orders file'''
  s = line.strip().split(",")
  try:
      if s[6] != u"B" and s[6] != u"S":
        raise Exception('Wrong format ' + str(s))
      return [{"time": datetime.strptime(s[0], "%Y-%m-%d %H:%M:%S"),
               "orderId": int(s[1]), 
               "clientId": int(s[2]),
               "symbol": s[3], 
               "amount": int(s[4]), 
               "price":  float(s[5]), 
               "buy": s[6] == u"B"}]
  except Exception as err:
      print("Wrong line format (%s): %s" % (line,err))
      return []

from operator import add
orders = inputStream.map(lambda x: x[1]).flatMap(parseOrder)
ordersPerMinute = orders.map(lambda o: 1).window(60, 15) # windows lenth = 60 sec, slide = 15 sec
orderCountPerMinute = ordersPerMinute.reduce(add)
orderCountPerMinute.pprint()

# windows operations requires checkpointing
sc.setCheckpointDir("checkpoint")

ssc.start()
ssc.awaitTermination()

Writing 4-salesPerMinuteFromKafka.py


The script `scripts/streamOrdersToKafka.py` can be used send the contents of `data/orders.txt` to the `$USER.orders` kafka topics, line by line, with 1 line published every 0.5 seconds. Execute this script in a separate notebook.

In parallel, execute the following cell to run the spark streaming job.

In [None]:
# We need to specify `--packages org.apache....` so that the driver required to connect to kafka from python
# is automatically downloaded if we don't already have it.
!spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.4.4 4-salesPerMinuteFromKafka.py