===========================================


Title: 7.2 Exercises


Author: Chad Wood


Date: 6 Feb 2022


Modified By: Chad Wood


Description: This program demonstrates using Spark to maintain a constant word count from text streamed over a TCP socket.

=========================================== 

# Stream Processing with Spark

## Assignment 7

### Assignment 7.1

In this assignment, you will run the [quick example from Apache Spark's structured streaming programming guide](http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#quick-example). 
The first step involves opening a terminal and running the netcat command. You must start this command before running the Spark code. You can open a terminal in JupyterLab by selecting *File -> New -> Terminal* from the menu. Start the netcat programming running on port 9999 by using the following command. 

```
nc -lk 9999
```

With netcat still running in the terminal, come back to this notebook and run the following code. 

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split


spark = SparkSession \
    .builder \
    .appName("Assignment 7.1") \
    .getOrCreate()

lines = spark \
    .readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .load()

# Split the lines into words
words = lines.select(
   explode(
       split(lines.value, " ")
   ).alias("word")
)

# Generate running word count
wordCounts = words.groupBy("word").count().where('count > 2')

In [None]:
try:
    query = wordCounts \
        .writeStream \
        .outputMode("complete") \
        .format("console") \
        .start()

    query.awaitTermination()
except KeyboardInterrupt:
    print('Stopping query')

Now you can go back to the terminal and enter text values. You can input any text that you like. You can use [The Project Gutenberg EBook of The Odyssey, by Homer](https://www.gutenberg.org/files/1727/1727-0.txt) if you can't think of any sample text. After entering the text in the terminal, you should output within this notebook. 

<i>I got this to work, however it doesn't appear possible to output the result in Notebook. 
I'm getting to this conclusion after referring to these forums:
https://stackoverflow.com/questions/55840424/spark-streaming-awaittermination-in-jupyter-notebook
https://stackoverflow.com/questions/61463554/structured-streaming-output-is-not-showing-on-jupyter-notebook
    
    Please let me know if I'm mistaken with this.
</i>

One of my outputs from terminal:
```
-------------------------------------------
Batch: 1
-------------------------------------------
+--------+-----+
|    word|count|
+--------+-----+
|  online|    1|
|   parts|    1|
|  Title:|    1|
|      If|    1|
|  anyone|    1|
|    copy|    1|
|     not|    1|
|    will|    1|
|      by|    1|
|   using|    1|
|     you|    3|
|     for|    1|
|   under|    1|
|      in|    2|
|  States|    1|
|   Homer|    1|
|Odyssey,|    1|
|    with|    2|
|   terms|    1|
| Odyssey|    1|
+--------+-----+
only showing top 20 rows
```

### Assignment 7.2

Modify the word count query so that the streaming query only returns results where the word count is greater than two. Test your code by typing repeated words into the netcat program. 

In [None]:
# Generate running word count where count > 2
wordCounts = words.groupBy("word").count().where('count > 2')

try:
    query = wordCounts \
        .writeStream \
        .outputMode("complete") \
        .format("console") \
        .start()

    query.awaitTermination()
except KeyboardInterrupt:
    print('Stopping query')

netcat input: <i>here
    is multiple words for you to see that multiple words are possible to establish in the words count for count</i>
    
output:
```
Batch: 0
-------------------------------------------
+----+-----+
|word|count|
+----+-----+
+----+-----+

-------------------------------------------
Batch: 1
-------------------------------------------
+----+-----+
|word|count|
+----+-----+
+----+-----+

-------------------------------------------
Batch: 2
-------------------------------------------
+-----+-----+
| word|count|
+-----+-----+
|words|    3|
+-----+-----+
```