## Steven Miller
### DSC 650 Winter 2020
### 2020-01-27

#### 8.2 Programming Exercise: Streaming Data

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession.builder.appName('Exercise8').getOrCreate()

# 1. Stream Directory Data

In the first part of the exercise, you will create a simple Spark streaming program that reads an input stream from a file source. The file source stream reader reads data from a directory on a file system. When a new file is added to the folder, Spark adds that file’s data to the input data stream.

You can find the input data for this exercise in the baby-names/streaming directory. This directory contains the baby names CSV file randomized and split into 98 individual files. You will use these files to simulate incoming streaming data.

**a. Count the Number of Females**

In the first part of the exercise, you will create a Spark program that monitors an incoming directory. To simulate streaming data, you will copy CSV files from the baby-names/streaming directory into the incoming directory. Since you will be loading CSV data, you will need to define a schema before you initialize the streaming dataframe.

From this input data stream, you will create a simple output data stream that counts the number of females and writes it to the console. Approximately every 10 seconds or so, copy a new file into the directory and report the console output. Do this for the first ten files.

In [2]:
dataSchema = StructType([StructField('state',StringType(),True),
                          StructField('sex',StringType(),True),
                          StructField('year',IntegerType(),True),
                          StructField('name',StringType(),True),
                          StructField('count',IntegerType(),True)])

streaming = spark.readStream.schema(dataSchema).option("maxFilesPerTrigger", 1).csv('streaming/', header=True)

counts = streaming.groupBy("sex").count()

In [3]:
query = counts.writeStream.queryName("counts").format("memory").outputMode("complete").start()

Initial Query of Empty Directory

In [4]:
spark.sql("SELECT * FROM counts WHERE sex='F'").show()

+---+-----+
|sex|count|
+---+-----+
+---+-----+



After File 00

In [6]:
spark.sql("SELECT * FROM counts WHERE sex='F'").show()

+---+-----+
|sex|count|
+---+-----+
|  F|33417|
+---+-----+



After File 01

In [7]:
spark.sql("SELECT * FROM counts WHERE sex='F'").show()

+---+-----+
|sex|count|
+---+-----+
|  F|67008|
+---+-----+



After File 02

In [8]:
spark.sql("SELECT * FROM counts WHERE sex='F'").show()

+---+------+
|sex| count|
+---+------+
|  F|100361|
+---+------+



After File 03

In [9]:
spark.sql("SELECT * FROM counts WHERE sex='F'").show()

+---+------+
|sex| count|
+---+------+
|  F|133784|
+---+------+



After File 04

In [10]:
spark.sql("SELECT * FROM counts WHERE sex='F'").show()

+---+------+
|sex| count|
+---+------+
|  F|167264|
+---+------+



After File 05

In [11]:
spark.sql("SELECT * FROM counts WHERE sex='F'").show()

+---+------+
|sex| count|
+---+------+
|  F|200789|
+---+------+



After File 06

In [12]:
spark.sql("SELECT * FROM counts WHERE sex='F'").show()

+---+------+
|sex| count|
+---+------+
|  F|234184|
+---+------+



After File 07

In [13]:
spark.sql("SELECT * FROM counts WHERE sex='F'").show()

+---+------+
|sex| count|
+---+------+
|  F|267752|
+---+------+



After File 08

In [14]:
spark.sql("SELECT * FROM counts WHERE sex='F'").show()

+---+------+
|sex| count|
+---+------+
|  F|301272|
+---+------+



After File 09

In [15]:
spark.sql("SELECT * FROM counts WHERE sex='F'").show()

+---+------+
|sex| count|
+---+------+
|  F|334723|
+---+------+



**2. Micro-Batching**

Repeat the last step, but use a micro-batch interval to trigger the processing every 30 seconds. Approximately every 10 seconds or so, copy a new file into the directory and report the console output. Do this for the first ten files. How did the output differ from the previous example?

In [7]:
dataSchema = StructType([StructField('state',StringType(),True),
                          StructField('sex',StringType(),True),
                          StructField('year',IntegerType(),True),
                          StructField('name',StringType(),True),
                          StructField('count',IntegerType(),True)])

micro_batch = spark.readStream.schema(dataSchema).csv('streaming/', header=True)
counts = micro_batch.groupBy("sex").count()

microbatch_writer = counts.writeStream.trigger(processingTime = '30 seconds').queryName("counts").format("memory").outputMode("complete").start()
microbatch_writer.isActive

True

Initialization

In [8]:
spark.sql("SELECT * FROM counts WHERE sex='F'").show()

+---+-----+
|sex|count|
+---+-----+
+---+-----+



30 seconds

In [9]:
spark.sql("SELECT * FROM counts WHERE sex='F'").show()

+---+------+
|sex| count|
+---+------+
|  F|100361|
+---+------+



60 seconds

In [10]:
spark.sql("SELECT * FROM counts WHERE sex='F'").show()

+---+------+
|sex| count|
+---+------+
|  F|200789|
+---+------+



90 seconds

In [11]:
spark.sql("SELECT * FROM counts WHERE sex='F'").show()

+---+------+
|sex| count|
+---+------+
|  F|301272|
+---+------+



120 seconds

In [12]:
spark.sql("SELECT * FROM counts WHERE sex='F'").show()

+---+------+
|sex| count|
+---+------+
|  F|334723|
+---+------+

