# Generating the input data for the realtime dashboard

In this exercise, we'll use Spark structured streaming to generate the input data for the realtime dashboard.

In [1]:
%%bash
# Install the required Python 3 dependencies
python3 -m pip install kafka-python  # type: ignore



Create a Spark context and specify that the python spark-kafka libraries need to be added.

In [None]:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3 pyspark-shell'

import pyspark 
from pyspark import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

sc = SparkContext()
sc.setLogLevel("WARN")
spark = SparkSession(sc)

Create a streaming DataFrame that represents the events received from the Kafka topic `clicks-cleaned`.

In [None]:
df = spark.readStream.format("kafka") \
    .option("kafka.bootstrap.servers","10.10.139.63:9092") \
    .option("subscribe", "clicks-cleaned") \
    .option("startingOffsets", "earliest") \
    .load()

Cast the json to columns in the DataFrame. Make sure to use TimestampType for the `ts_ingest` since we already converted it in the `clean` notebook.

In [None]:
schema = StructType([
    StructField("visitor_platform", StringType()),
    StructField("ts_ingest", TimestampType()),
    StructField("article_title", StringType()),
    StructField("visitor_country", StringType()),
    StructField("visitor_os", StringType()),
    StructField("article", StringType()),
    StructField("visitor_browser", StringType()),
    StructField("visitor_page_timer", IntegerType()),
    StructField("visitor_page_height", IntegerType()),
])

print(df.schema)

dfs = df.selectExpr("CAST(value AS STRING)") \
      .select(from_json(col("value"), schema) \
      .alias("clicks"))

df_data = dfs.select("clicks.*")


Generate the values you want to show in your dashboard. You are free to choose which values and aggregations to show. As an example, you can group by article title and use a 10 seconds window in order to show how many views each article received.

In [None]:
df_data_grouped = (
    df_data
#         .withWatermark("timestamp", "20 second") # Late data?
        .groupBy(
            df_data['article_title'],
            window(df_data['ts_ingest'], "10 seconds"))
        .count()[]        
)

And finally, run the continuous query and write the outputs to Kafka topics of your choosing.

In [None]:
# Debug in terminal
# Docs output modes https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes
query = df_data_grouped.writeStream.outputMode("output").option("truncate", "false").format("console").start()
query.awaitTermination()