# Generating the input data for the realtime dashboard

In this exercise, we'll use Spark structured streaming to generate the input data for the realtime dashboard.

In [1]:
%%bash
# Install the required Python 3 dependencies
python3 -m pip install kafka-python  # type: ignore



In [2]:

from time import sleep
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 pyspark-shell'

from IPython.display import display, clear_output

import pyspark 
from pyspark import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *


def test_query(sdf, mode="append", rows=None, wait=2, sort=None):
    # If a query with the same name exists, stop it.
    query_name = "test_query"
    query = None
    for q in spark.streams.active:
        if (q.name == query_name):
            query = q
    if query is not None:
        query.stop()

    try:
        tq = (
            # Create an output stream
            sdf.writeStream               
            # Only write new rows to the output
            .outputMode(mode)           
            # Write output stream to an in-memory Spark table (a DataFrame)
            .format("memory")               
            # The name of the output table will be the same as the name of the query
            .queryName(query_name)
            # Submit the query to Spark and execute it
            .start()
        )

        tq.processAllAvailable()

        sleep(wait)
        while(tq.status.get("isTriggerActive") == True):
            print(f"DataAvailable: {tq.status['isDataAvailable']},\tTriggerActive: {tq.status['isTriggerActive']}\t{tq.status['message']}")
            sleep(wait)

        # When the status says "Waiting for data to arrive", that means the query
        # has finished its current iteration and is waiting for new messages from
        # Kafka.
        print(f"DataAvailable: {tq.status['isDataAvailable']},\tTriggerActive: {tq.status['isTriggerActive']}\t{tq.status['message']}")

        memory_sink = spark.table(query_name)

        if sort:
            memory_sink = memory_sink.sort(*sort)

        # Show result table in Jupyter Notebook. Since Jupyter Notebooks have native support for showing pandas tables,
        # we convert the Spark DataFrame.
        if rows:
            display(memory_sink)
            display(memory_sink.take(10))
        else:
            display(memory_sink)
            display(memory_sink.toPandas())

    finally:
        # Always try to stop the query but it doesn't matter if it fails.
        try:
            tq.stop()
        except:
            pass


Create a Spark context and specify that the python spark-kafka libraries need to be added.

In [4]:
# Create a local Spark cluster with two executors (if it doesn't already exist)
spark = SparkSession.builder.master('local[2]').getOrCreate()

Ivy Default Cache set to: /home/jovyan/.ivy2/cache
The jars for the packages stored in: /home/jovyan/.ivy2/jars
org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-67178843-b4fd-4d1c-8194-878ed34780c6;1.0
	confs: [default]


:: loading settings :: url = jar:file:/usr/local/spark-3.2.0-bin-hadoop3.2/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


	found org.apache.spark#spark-sql-kafka-0-10_2.12;3.2.0 in central
	found org.apache.spark#spark-token-provider-kafka-0-10_2.12;3.2.0 in central
	found org.apache.kafka#kafka-clients;2.8.0 in central
	found org.lz4#lz4-java;1.7.1 in central
	found org.xerial.snappy#snappy-java;1.1.8.4 in central
	found org.slf4j#slf4j-api;1.7.30 in central
	found org.apache.hadoop#hadoop-client-runtime;3.3.1 in central
	found org.spark-project.spark#unused;1.0.0 in central
	found org.apache.hadoop#hadoop-client-api;3.3.1 in central
	found org.apache.htrace#htrace-core4;4.1.0-incubating in central
	found commons-logging#commons-logging;1.1.3 in central
	found com.google.code.findbugs#jsr305;3.0.0 in central
	found org.apache.commons#commons-pool2;2.6.2 in central
:: resolution report :: resolve 381ms :: artifacts dl 7ms
	:: modules in use:
	com.google.code.findbugs#jsr305;3.0.0 from central in [default]
	commons-logging#commons-logging;1.1.3 from central in [default]
	org.apache.commons#commons-pool2;2.

Create a streaming DataFrame that represents the events received from the Kafka topic `clicks-cleaned`.

In [5]:
input = (
    spark.readStream.format("kafka")
    # The Kafka server is available on localhost port 9092
    .option("kafka.bootstrap.servers","localhost:9092")
    # Read the "clicks-cleaned" topic
    .option("subscribe", "clicks-cleaned")
    # Start at the beginning of this topic. This will read all historical data from Kafka.
    # Use "latest" if you only want to process _new_ events.
    .option("startingOffsets", "earliest")
    # Process a maximum of 5 offsets per trigger
    .option("maxOffsetsPerTrigger", "5")
    # Return a Streaming DataFrame representing this stream
    .load()
)

test_query(input, mode="append")

21/12/13 00:24:23 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-ea34d396-87cf-49cf-86ce-ef173b21bd39. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
21/12/13 00:24:23 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
                                                                                

DataAvailable: False,	TriggerActive: False	Waiting for data to arrive
DataAvailable: False,	TriggerActive: False	Waiting for data to arrive
DataAvailable: False,	TriggerActive: False	Waiting for data to arrive


DataFrame[key: binary, value: binary, topic: string, partition: int, offset: bigint, timestamp: timestamp, timestampType: int]

Unnamed: 0,key,value,topic,partition,offset,timestamp,timestampType
0,,"[123, 34, 118, 105, 115, 105, 116, 111, 114, 9...",clicks-cleaned,0,0,2021-12-12 21:54:42.497,0
1,,"[123, 34, 118, 105, 115, 105, 116, 111, 114, 9...",clicks-cleaned,0,1,2021-12-12 21:54:42.505,0
2,,"[123, 34, 118, 105, 115, 105, 116, 111, 114, 9...",clicks-cleaned,0,2,2021-12-12 21:54:42.506,0
3,,"[123, 34, 118, 105, 115, 105, 116, 111, 114, 9...",clicks-cleaned,0,3,2021-12-12 21:54:42.592,0
4,,"[123, 34, 118, 105, 115, 105, 116, 111, 114, 9...",clicks-cleaned,0,4,2021-12-12 21:54:42.593,0
...,...,...,...,...,...,...,...
72,,"[123, 34, 118, 105, 115, 105, 116, 111, 114, 9...",clicks-cleaned,0,72,2021-12-12 21:54:43.893,0
73,,"[123, 34, 118, 105, 115, 105, 116, 111, 114, 9...",clicks-cleaned,0,73,2021-12-12 21:54:43.893,0
74,,"[123, 34, 118, 105, 115, 105, 116, 111, 114, 9...",clicks-cleaned,0,74,2021-12-12 21:54:43.893,0
75,,"[123, 34, 118, 105, 115, 105, 116, 111, 114, 9...",clicks-cleaned,0,75,2021-12-12 21:54:43.893,0


Cast the json to columns in the DataFrame. Make sure to use TimestampType for the `ts_ingest` since we already converted it in the `clean` notebook.

In [6]:
schema = StructType([
    StructField("visitor_platform", StringType()),
    StructField("ts_ingest", TimestampType()),
    StructField("article_title", StringType()),
    StructField("visitor_country", StringType()),
    StructField("visitor_os", StringType()),
    StructField("article", StringType()),
    StructField("visitor_browser", StringType()),
    StructField("visitor_page_timer", IntegerType()),
    StructField("visitor_page_height", IntegerType()),
])

dfs = input.selectExpr("CAST(value AS STRING)") \
      .select(from_json(col("value"), schema) \
      .alias("clicks"))

df_data = dfs.select("clicks.*")

test_query(df_data, mode="append")

21/12/13 00:25:40 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-e2d44e4b-b98e-4de2-b9be-2a9fe221622f. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
21/12/13 00:25:40 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


DataAvailable: False,	TriggerActive: False	Waiting for data to arrive


DataFrame[visitor_platform: string, ts_ingest: timestamp, article_title: string, visitor_country: string, visitor_os: string, article: string, visitor_browser: string, visitor_page_timer: int, visitor_page_height: int]

Unnamed: 0,visitor_platform,ts_ingest,article_title,visitor_country,visitor_os,article,visitor_browser,visitor_page_timer,visitor_page_height
0,mobile,2021-12-12 21:29:03,Cercanías San Sebastián,BE,ios,https://en.wikipedia.org/wiki/Cercan%C3%ADas_S...,unknown,0,0
1,mobile,2021-12-12 21:29:03,Kingdom of Hawaii,BE,ios,https://en.wikipedia.org/wiki/Kingdom_of_Hawaii,unknown,0,0
2,desktop,2021-12-12 21:29:03,Republican National Coalition for Life,BE,windows,https://en.wikipedia.org/wiki/Republican_Natio...,firefox,4350,18743
3,desktop,2021-12-12 21:29:03,"Black Mesa (Warm Springs, Arizona)",BE,windows,https://en.wikipedia.org/wiki/Black_Mesa_(Warm...,unknown,1117,5000
4,desktop,2021-12-12 21:29:03,Lavalle House,BE,windows,https://en.wikipedia.org/wiki/Lavalle_House,chrome,1409,39838
...,...,...,...,...,...,...,...,...,...
72,mobile,2021-12-12 21:29:06,Kingdom of Hawaii,BE,ios,https://en.wikipedia.org/wiki/Kingdom_of_Hawaii,unknown,0,0
73,mobile,2021-12-12 21:29:06,Sky (company),BE,ios,https://en.wikipedia.org/wiki/Sky_(company),unknown,0,0
74,tablet,2021-12-12 21:29:06,2010 North African Super Cup,BE,ios,https://en.wikipedia.org/wiki/2010_North_Afric...,safari,12222,4175
75,desktop,2021-12-12 21:29:06,Randomized algorithm,BE,mac,https://en.wikipedia.org/wiki/Randomized_algor...,safari,584,5465


Generate the values you want to show in your dashboard. You are free to choose which values and aggregations to show. As an example, you can group by article title and use a 10 seconds window in order to show how many views each article received.

In [8]:
df_data_grouped = (
    df_data
        .withWatermark("ts_ingest", "1 second")
        .groupBy(
            col('article_title'),
            window(col('ts_ingest'), "2 seconds"))
        .count()     
)

test_query(df_data_grouped, mode="append")

21/12/13 00:28:02 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-8881f705-6b72-4abb-9d17-e474a80f4150. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
21/12/13 00:28:02 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
                                                                                

DataAvailable: False,	TriggerActive: False	Waiting for data to arrive


DataFrame[article_title: string, window: struct<start:timestamp,end:timestamp>, count: bigint]

Unnamed: 0,article_title,window,count
0,Cercanías San Sebastián,"(2021-12-12 21:29:02, 2021-12-12 21:29:04)",2
1,To Be a Millionaire,"(2021-12-12 21:29:02, 2021-12-12 21:29:04)",1
2,A540 road,"(2021-12-12 21:29:02, 2021-12-12 21:29:04)",1
3,Intersex rights in Uganda,"(2021-12-12 21:29:02, 2021-12-12 21:29:04)",1
4,Granada Theater (Dallas),"(2021-12-12 21:29:02, 2021-12-12 21:29:04)",1
5,Republican National Coalition for Life,"(2021-12-12 21:29:02, 2021-12-12 21:29:04)",4
6,Battle of Cuddalore (1758),"(2021-12-12 21:29:02, 2021-12-12 21:29:04)",1
7,"Black Mesa (Warm Springs, Arizona)","(2021-12-12 21:29:02, 2021-12-12 21:29:04)",1
8,1982 Daytona 500,"(2021-12-12 21:29:02, 2021-12-12 21:29:04)",1
9,Sternbergia candida,"(2021-12-12 21:29:02, 2021-12-12 21:29:04)",1
