# Part 3: Streaming Tweet Sentiment Prediction

Work on this after you have completed part 1 (model training and persistence) and part 2 (Receiving Twitter data and write to a directory)

## Step 1: Within one cell, write a streaming application

- load the saved Pipeline Model as `pipelineModel` as part of the streaming processing:
- load streaming data from directory `file:/databricks/driver/tweets`
- use the pipeline model to transform the streaming dataframe.
- drop the unwanted intermediarely columns in the output, keeping only `text`, `time`, and `prediction`
- output to a `memory` sink (`scored_tweets`) using the append mode.
- trigger output in `2 seconds` intervals.

In [0]:
from pyspark.ml import PipelineModel
modelPath = "/FileStore/twitter_nbpipeline"
#inputPath = "/file:/databricks/driver/tweets"
inputPath = "file:/databricks/driver/tweets"
pipelineModel = PipelineModel.load(modelPath)
#input source
streamingInputDF = spark.readStream.schema("time timestamp, text string").json(inputPath)

#processing logic
scored_tweets = pipelineModel.transform(streamingInputDF)

#sink
query = scored_tweets.drop("rawPrediction", "probability", "features", "words", "words_filtered").writeStream.trigger(processingTime = "2 seconds").format("memory").queryName("scored_tweets").outputMode("append").start()

In [0]:
from pyspark.sql.functions import *
#from pyspark.ml 

## Step 2: View the stream results

- Query the number of rows in the `scored_teweets` table
- Visulize the count of positive and negative tweets by 30 second windows.

In [0]:
%sql 
select count(*) from scored_tweets;

count(1)
5210


In [0]:
%sql 
select sum(if(prediction=1,1,0)) as positive, sum(if(prediction=0,1,0)) as negative, window(time,"30 seconds") from scored_tweets 
group by window(time,"30 seconds");

positive,negative,window
69,54,"List(2022-11-05T17:54:00.000+0000, 2022-11-05T17:54:30.000+0000)"
61,70,"List(2022-11-05T17:47:30.000+0000, 2022-11-05T17:48:00.000+0000)"
37,23,"List(2022-11-05T17:55:30.000+0000, 2022-11-05T17:56:00.000+0000)"
86,71,"List(2022-11-05T17:42:00.000+0000, 2022-11-05T17:42:30.000+0000)"
77,63,"List(2022-11-05T17:46:00.000+0000, 2022-11-05T17:46:30.000+0000)"
73,60,"List(2022-11-05T17:53:30.000+0000, 2022-11-05T17:54:00.000+0000)"
79,71,"List(2022-11-05T17:48:00.000+0000, 2022-11-05T17:48:30.000+0000)"
74,73,"List(2022-11-05T17:38:30.000+0000, 2022-11-05T17:39:00.000+0000)"
66,58,"List(2022-11-05T17:44:00.000+0000, 2022-11-05T17:44:30.000+0000)"
72,69,"List(2022-11-05T17:49:30.000+0000, 2022-11-05T17:50:00.000+0000)"


Output can only be rendered in Databricks