# DDoS and Intrusion detection


In this exercise, we'll use Spark structured streaming to detect DDoS attacks and attempts to access the admin panel of the website.

* Use the [fake-ddos](fake-ddos.ipynb) notebook to simulate a DDoS attack.
* Use the [fake-intrusion](fake-intrusion.ipynb) notebook to simulate an intrusion attempt.

First, we'll add the same test_query function function from the cleanup notebook.

In [1]:

from time import sleep
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 pyspark-shell'

from IPython.display import display, clear_output

import pyspark 
from pyspark import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *


def test_query(sdf, mode="append", rows=None, wait=2, sort=None):
    # If a query with the same name exists, stop it.
    query_name = "test_query"
    query = None
    for q in spark.streams.active:
        if (q.name == query_name):
            query = q
    if query is not None:
        query.stop()

    try:
        tq = (
            # Create an output stream
            sdf.writeStream               
            # Only write new rows to the output
            .outputMode(mode)           
            # Write output stream to an in-memory Spark table (a DataFrame)
            .format("memory")               
            # The name of the output table will be the same as the name of the query
            .queryName(query_name)
            # Submit the query to Spark and execute it
            .start()
        )

        tq.processAllAvailable()

        sleep(wait)
        while(tq.status.get("isTriggerActive") == True):
            print(f"DataAvailable: {tq.status['isDataAvailable']},\tTriggerActive: {tq.status['isTriggerActive']}\t{tq.status['message']}")
            sleep(wait)

        # When the status says "Waiting for data to arrive", that means the query
        # has finished its current iteration and is waiting for new messages from
        # Kafka.
        print(f"DataAvailable: {tq.status['isDataAvailable']},\tTriggerActive: {tq.status['isTriggerActive']}\t{tq.status['message']}")

        memory_sink = spark.table(query_name)

        if sort:
            memory_sink = memory_sink.sort(*sort)

        # Show result table in Jupyter Notebook. Since Jupyter Notebooks have native support for showing pandas tables,
        # we convert the Spark DataFrame.
        if rows:
            display(memory_sink)
            display(memory_sink.take(10))
        else:
            display(memory_sink)
            display(memory_sink.toPandas())

    finally:
        # Always try to stop the query but it doesn't matter if it fails.
        try:
            tq.stop()
        except:
            pass


In [2]:
%%bash
# Install the required Python 3 dependencies
python3 -m pip install kafka-python pyarrow  # type: ignore



Create a Spark context and specify that the python spark-kafka libraries need to be added.

In [3]:
# Create a local Spark cluster with two executors (if it doesn't already exist)
spark = SparkSession.builder.master('local[2]').getOrCreate()



:: loading settings :: url = jar:file:/usr/local/spark-3.2.0-bin-hadoop3.2/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/jovyan/.ivy2/cache
The jars for the packages stored in: /home/jovyan/.ivy2/jars
org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-4e759fdd-8a58-4e60-9c59-de65a72e63dd;1.0
	confs: [default]
	found org.apache.spark#spark-sql-kafka-0-10_2.12;3.2.0 in central
	found org.apache.spark#spark-token-provider-kafka-0-10_2.12;3.2.0 in central
	found org.apache.kafka#kafka-clients;2.8.0 in central
	found org.lz4#lz4-java;1.7.1 in central
	found org.xerial.snappy#snappy-java;1.1.8.4 in central
	found org.slf4j#slf4j-api;1.7.30 in central
	found org.apache.hadoop#hadoop-client-runtime;3.3.1 in central
	found org.spark-project.spark#unused;1.0.0 in central
	found org.apache.hadoop#hadoop-client-api;3.3.1 in central
	found org.apache.htrace#htrace-core4;4.1.0-incubating in central
	found commons-logging#commons-logging;1.1.3 in central
	found com.google.code.findbugs#jsr305;3.0.0 in central

Create a streaming DataFrame that represents the events received from the Kafka topic `clicks-cleaned`.

In [4]:
input = (
    spark.readStream.format("kafka")
    # The Kafka server is available on localhost port 9092
    .option("kafka.bootstrap.servers","localhost:9092")
    # Read the "clicks" topic
    .option("subscribe", "clicks-cleaned")
    # Start at the beginning of this topic. This will read all historical data from Kafka.
    # Use "latest" if you only want to process _new_ events.
    .option("startingOffsets", "earliest")
    # Return a Streaming DataFrame representing this stream
    .load()
)    

Cast the json to columns in the DataFrame. Make sure to use TimestampType for the `ts_ingest` since we already converted it in the `clean` notebook.

In [5]:
schema = StructType([
    StructField("visitor_platform", StringType()),
    StructField("ts_ingest", TimestampType()),
    StructField("article_title", StringType()),
    StructField("visitor_country", StringType()),
    StructField("visitor_os", StringType()),
    StructField("article", StringType()),
    StructField("visitor_browser", StringType()),
    StructField("visitor_page_timer", IntegerType()),
    StructField("visitor_page_height", IntegerType()),
])

decoded_json_stream = (
    input
    .selectExpr("CAST(value AS STRING)")
    .select(from_json(col("value"), schema).alias("clicks"))
    .select("clicks.*")
)

test_query(decoded_json_stream, mode="append")

21/12/13 14:33:12 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-0504b17a-699f-4087-a0a5-af38d7e931e6. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
21/12/13 14:33:12 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
                                                                                

DataAvailable: False,	TriggerActive: False	Waiting for data to arrive
DataAvailable: False,	TriggerActive: False	Waiting for data to arrive
DataAvailable: False,	TriggerActive: False	Waiting for data to arrive


DataFrame[visitor_platform: string, ts_ingest: timestamp, article_title: string, visitor_country: string, visitor_os: string, article: string, visitor_browser: string, visitor_page_timer: int, visitor_page_height: int]

Unnamed: 0,visitor_platform,ts_ingest,article_title,visitor_country,visitor_os,article,visitor_browser,visitor_page_timer,visitor_page_height
0,mobile,2021-12-13 14:28:46,Cercanías San Sebastián,BE,ios,https://en.wikipedia.org/wiki/Cercan%C3%ADas_S...,unknown,0,0
1,mobile,2021-12-13 14:28:46,Kingdom of Hawaii,BE,ios,https://en.wikipedia.org/wiki/Kingdom_of_Hawaii,unknown,0,0
2,desktop,2021-12-13 14:28:46,Republican National Coalition for Life,BE,windows,https://en.wikipedia.org/wiki/Republican_Natio...,firefox,4350,18743
3,desktop,2021-12-13 14:28:46,"Black Mesa (Warm Springs, Arizona)",BE,windows,https://en.wikipedia.org/wiki/Black_Mesa_(Warm...,unknown,1117,5000
4,mobile,2021-12-13 14:28:46,Cercanías San Sebastián,BE,ios,https://en.wikipedia.org/admin,unknown,0,0
...,...,...,...,...,...,...,...,...,...
1363,mobile,2021-12-13 14:30:47,Robert L. Rutherford,BE,android,https://en.wikipedia.org/wiki/Robert_L._Ruther...,chrome,6864,4860
1364,desktop,2021-12-13 14:30:47,Heliamphora neblinae,BE,windows,https://en.wikipedia.org/wiki/Heliamphora_nebl...,unknown,9359,3966
1365,mobile,2021-12-13 14:30:47,Battle of Cuddalore (1758),BE,android,https://en.wikipedia.org/wiki/Battle_of_Cuddal...,chrome,9863,4162
1366,tablet,2021-12-13 14:30:47,Onnanu Nammal,BE,ios,https://en.wikipedia.org/wiki/Onnanu_Nammal,unknown,0,0


Create a [user-defined function (`udf`)](https://docs.databricks.com/spark/latest/spark-sql/udf-python.html) `forbidden_clicks` which takes a URL as input and returns `True` if the URL points to the admin part of the website (when it ends with `/admin`).

As an example, the following code creates a UDF which squares each value of a column. It is used on the "id" column and the resulting column's name is changed to "id_squared".

```python
from pyspark.sql.functions import udf

@udf("long")
def squared_udf(s):
  return s * s

df = spark.table("test")
display(df.select("id", squared_udf("id").alias("id_squared")))
```

In [6]:
@udf
def forbidden_clicks(click_url):
    return click_url.endswith('/admin')

Use the UDF to create the dataframe `df_forbidden` which contains the collumn `forbidden` which specifies if the URL is an admin URL.

In [7]:
# For every article url check if it is a forbidden url.
# We can not use the map() function here, dataframes do not support this anymore since version 2.0.
# Under the hood calling map() on a dataframe would transform it to an RDD which is not allowed in structured streaming.
# This means you can use only DataFrame or SQL. Conversion to RDD (or DStream or local collections) are not supported.
# Because of this we will use a User Defined Function (UDF) to execute some Pyhton code on a column.
df_forbidden = (
    decoded_json_stream
    .select('article', forbidden_clicks('article').cast('boolean').alias('forbidden'))
)

test_query(df_forbidden, mode="append")

21/12/13 14:33:35 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-66257e0f-8f52-473b-8da6-e7a3643263dd. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
21/12/13 14:33:35 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
21/12/13 14:33:36 WARN KafkaDataConsumer: KafkaDataConsumer is not running in UninterruptibleThread. It may hang when KafkaDataConsumer's methods are interrupted because of KAFKA-1894
21/12/13 14:33:36 WARN KafkaDataConsumer: KafkaDataConsumer is not running in UninterruptibleThread. It may hang when KafkaDataConsumer's methods are interrupted because of KAFKA-1894
21/12/13 14:33:36 WARN KafkaDataConsumer: KafkaDataConsumer is not running in UninterruptibleThread. It

DataAvailable: False,	TriggerActive: False	Waiting for data to arrive


DataFrame[article: string, forbidden: boolean]

Unnamed: 0,article,forbidden
0,https://en.wikipedia.org/wiki/Cercan%C3%ADas_S...,False
1,https://en.wikipedia.org/wiki/Kingdom_of_Hawaii,False
2,https://en.wikipedia.org/wiki/Republican_Natio...,False
3,https://en.wikipedia.org/wiki/Black_Mesa_(Warm...,False
4,https://en.wikipedia.org/admin,True
...,...,...
1363,https://en.wikipedia.org/wiki/Robert_L._Ruther...,False
1364,https://en.wikipedia.org/wiki/Heliamphora_nebl...,False
1365,https://en.wikipedia.org/wiki/Battle_of_Cuddal...,False
1366,https://en.wikipedia.org/wiki/Onnanu_Nammal,False


We'll do the same for detecting ddos attacks. First we want to flag whether an individual event is suspicious, i.e. whether the page_timer and page_height are both `0`. However, this time we'll use a `pandas_udf`.

[Regular Python UDF's have the disadvantage that they operate on one row at a time](https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html), causing them to suffer from high serialization and invocation overhead. Pandas UDF's are built on top of Apache Arrow to support high-performant UDF's in Python.

This is the squared_udf converted to a pandas udf.

```python
from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf('long', PandasUDFType.SCALAR)
def squared_pandas_udf(s):
    return s * s

df = spark.table("test")
display(df.select("id", squared_udf("id").alias("id_squared")))
```

The regular UDF version works one row at a time: the user-defined function takes a long `s` and returns the result of `s*s` as a long. In the Pandas version, the user-defined function takes a pandas.Series `s` and returns the result of `s*s` as a pandas.Series. Because `s*s` is vectorized on `pandas.Series`, the Pandas version is much faster than the row-at-a-time version.

Note that there are two important requirements when using scalar pandas UDFs:

* The input and output series must have the same size.
* How a column is split into multiple pandas.Series is internal to Spark, and therefore the result of user-defined function must be independent of the splitting.


In [8]:
# Window over last X seconds, count number of 'visitor_page_timer' and 'visitor_page_height' == 0
@pandas_udf('boolean', PandasUDFType.SCALAR)
def ddos_flagged(page_timer, page_height):
    return (page_timer == 0) & (page_height == 0)

# use ddos_flagged to create df_ddos, where all suspicious events are flagged.
df_ddos = (
    decoded_json_stream
    .select("*", ddos_flagged('visitor_page_timer', 'visitor_page_height').alias('flagged'))
)

test_query(df_ddos, mode="append")

21/12/13 14:34:02 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-5036d677-709f-4473-b3c1-f08d0935a96a. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
21/12/13 14:34:02 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
21/12/13 14:34:03 WARN KafkaDataConsumer: KafkaDataConsumer is not running in UninterruptibleThread. It may hang when KafkaDataConsumer's methods are interrupted because of KAFKA-1894
21/12/13 14:34:03 WARN KafkaDataConsumer: KafkaDataConsumer is not running in UninterruptibleThread. It may hang when KafkaDataConsumer's methods are interrupted because of KAFKA-1894
21/12/13 14:34:03 WARN KafkaDataConsumer: KafkaDataConsumer is not running in UninterruptibleThread. It

DataAvailable: False,	TriggerActive: True	Waiting for data to arrive


DataFrame[visitor_platform: string, ts_ingest: timestamp, article_title: string, visitor_country: string, visitor_os: string, article: string, visitor_browser: string, visitor_page_timer: int, visitor_page_height: int, flagged: boolean]

Unnamed: 0,visitor_platform,ts_ingest,article_title,visitor_country,visitor_os,article,visitor_browser,visitor_page_timer,visitor_page_height,flagged
0,mobile,2021-12-13 14:28:46,Cercanías San Sebastián,BE,ios,https://en.wikipedia.org/wiki/Cercan%C3%ADas_S...,unknown,0,0,True
1,mobile,2021-12-13 14:28:46,Kingdom of Hawaii,BE,ios,https://en.wikipedia.org/wiki/Kingdom_of_Hawaii,unknown,0,0,True
2,desktop,2021-12-13 14:28:46,Republican National Coalition for Life,BE,windows,https://en.wikipedia.org/wiki/Republican_Natio...,firefox,4350,18743,False
3,desktop,2021-12-13 14:28:46,"Black Mesa (Warm Springs, Arizona)",BE,windows,https://en.wikipedia.org/wiki/Black_Mesa_(Warm...,unknown,1117,5000,False
4,mobile,2021-12-13 14:28:46,Cercanías San Sebastián,BE,ios,https://en.wikipedia.org/admin,unknown,0,0,True
...,...,...,...,...,...,...,...,...,...,...
1363,mobile,2021-12-13 14:30:47,Robert L. Rutherford,BE,android,https://en.wikipedia.org/wiki/Robert_L._Ruther...,chrome,6864,4860,False
1364,desktop,2021-12-13 14:30:47,Heliamphora neblinae,BE,windows,https://en.wikipedia.org/wiki/Heliamphora_nebl...,unknown,9359,3966,False
1365,mobile,2021-12-13 14:30:47,Battle of Cuddalore (1758),BE,android,https://en.wikipedia.org/wiki/Battle_of_Cuddal...,chrome,9863,4162,False
1366,tablet,2021-12-13 14:30:47,Onnanu Nammal,BE,ios,https://en.wikipedia.org/wiki/Onnanu_Nammal,unknown,0,0,True


In the cell above we highlight the use of high performance User Defined Functions (UDF's) with pandas. For simple use cases such as the one here we could also avoid using UDF's and write the following instead:

```python
df_ddos = df_data.withColumn('flagged', when((col('visitor_page_timer') == 0) & (col('visitor_page_height') == 0), True).otherwise(False))
```

The second step in detecting a ddos attack is counting how many suspicious events happen within a certain timeframe. For this, well combine `groupBy` and a 30 seconds `window` based on the `ts_ingest` timestamp.

In [9]:
df_ddos_window = (
    df_ddos
    .groupBy(
        window(col("ts_ingest"), '30 seconds'),
        col("flagged")
    ).count()
)

test_query(df_ddos_window, mode="complete")

21/12/13 14:34:22 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-5d0d28f7-44dc-4faa-b761-1fddc7d8ecf2. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
21/12/13 14:34:22 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
21/12/13 14:34:23 WARN KafkaDataConsumer: KafkaDataConsumer is not running in UninterruptibleThread. It may hang when KafkaDataConsumer's methods are interrupted because of KAFKA-1894
21/12/13 14:34:23 WARN KafkaDataConsumer: KafkaDataConsumer is not running in UninterruptibleThread. It may hang when KafkaDataConsumer's methods are interrupted because of KAFKA-1894
21/12/13 14:34:23 WARN KafkaDataConsumer: KafkaDataConsumer is not running in UninterruptibleThread. It

DataAvailable: False,	TriggerActive: True	Getting offsets from KafkaV2[Subscribe[clicks-cleaned]]


DataFrame[window: struct<start:timestamp,end:timestamp>, flagged: boolean, count: bigint]

Unnamed: 0,window,flagged,count
0,"(2021-12-13 14:28:30, 2021-12-13 14:29:00)",True,305
1,"(2021-12-13 14:29:30, 2021-12-13 14:30:00)",True,16
2,"(2021-12-13 14:30:00, 2021-12-13 14:30:30)",False,51
3,"(2021-12-13 14:30:30, 2021-12-13 14:31:00)",False,54
4,"(2021-12-13 14:29:00, 2021-12-13 14:29:30)",False,349
5,"(2021-12-13 14:29:00, 2021-12-13 14:29:30)",True,316
6,"(2021-12-13 14:30:30, 2021-12-13 14:31:00)",True,23
7,"(2021-12-13 14:30:00, 2021-12-13 14:30:30)",True,32
8,"(2021-12-13 14:28:30, 2021-12-13 14:29:00)",False,222


Notice that for this query, we're using outputmode `complete` instead of `append`. This is because `append` mode can never change a row in the result table once it's written. However, Spark does not know when all events of a certain window have been seen. Spark assumes by default that data can be "late", meaning an earlier event can enter the stream _after_ a later event has entered. In `complete` mode, rows are written to the result table immediately when they become available and they are updated once new data arrives.

Although this solution in complete mode works, it will consume a lot of RAM over time because all intermediary results for all windows will be saved. Even if those windows ended years ago!

In order to solve this memory issue, you need to define when Spark can assume that it will not receive events from a certain window anymore. This is done using a [`watermark`](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking). With a watermark, you specify how "late" data can be.

For this exercise, you can assume data will not arrive more than 10 seconds late.

* Use [`withWatermark`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.withWatermark.html) to add a watermark to the query with a treshold of 10 seconds. Using the timestamp as eventime.
* Run the query using append mode.


In [10]:
df_ddos_window_watermark = (
    df_ddos
    .withWatermark("ts_ingest", "2 seconds")
    .groupBy(
        window(col("ts_ingest"), '2 seconds'),
        col("flagged")
    ).count()
)

test_query(df_ddos_window_watermark, mode="append")

21/12/13 14:36:39 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-ce563fa9-0b00-45f2-a87b-3306ea551406. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
21/12/13 14:36:39 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
21/12/13 14:36:40 WARN KafkaDataConsumer: KafkaDataConsumer is not running in UninterruptibleThread. It may hang when KafkaDataConsumer's methods are interrupted because of KAFKA-1894
21/12/13 14:36:40 WARN KafkaDataConsumer: KafkaDataConsumer is not running in UninterruptibleThread. It may hang when KafkaDataConsumer's methods are interrupted because of KAFKA-1894
21/12/13 14:36:40 WARN KafkaDataConsumer: KafkaDataConsumer is not running in UninterruptibleThread. It

DataAvailable: False,	TriggerActive: False	Waiting for data to arrive
DataAvailable: False,	TriggerActive: False	Waiting for data to arrive


DataFrame[window: struct<start:timestamp,end:timestamp>, flagged: boolean, count: bigint]

Unnamed: 0,window,flagged,count
0,"(2021-12-13 14:29:12, 2021-12-13 14:29:14)",False,5
1,"(2021-12-13 14:29:30, 2021-12-13 14:29:32)",True,7
2,"(2021-12-13 14:29:32, 2021-12-13 14:29:34)",True,7
3,"(2021-12-13 14:30:22, 2021-12-13 14:30:24)",False,17
4,"(2021-12-13 14:29:28, 2021-12-13 14:29:30)",False,11
5,"(2021-12-13 14:28:48, 2021-12-13 14:28:50)",True,78
6,"(2021-12-13 14:29:18, 2021-12-13 14:29:20)",True,9
7,"(2021-12-13 14:28:54, 2021-12-13 14:28:56)",False,45
8,"(2021-12-13 14:28:46, 2021-12-13 14:28:48)",True,16
9,"(2021-12-13 14:28:50, 2021-12-13 14:28:52)",False,29


Now run these queries and write the output to `clicks-calculated-forbidden` and `clicks-calculated-ddos`. Use a trigger with `processingTime = "30 seconds"` for the ddos query so that the next interval is only calculated 30 seconds after the first interval starts.

In [11]:
query_forbidden = (
    df_forbidden
    .selectExpr("to_json(struct(*)) as value")
    .writeStream.format("kafka")
    .outputMode('update')
    .option("kafka.bootstrap.servers", "localhost:9092")
    .option("topic", "clicks-calculated-forbidden")
    .option("checkpointLocation", "checkpoints-forbidden")
    .queryName("query_forbidden")
    .start()
)

# Sleep two seconds
sleep(2)

# Show the status of the query
display(query_forbidden.status)

21/12/13 14:41:06 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
21/12/13 14:41:06 WARN KafkaDataConsumer: KafkaDataConsumer is not running in UninterruptibleThread. It may hang when KafkaDataConsumer's methods are interrupted because of KAFKA-1894
21/12/13 14:41:06 WARN KafkaDataConsumer: KafkaDataConsumer is not running in UninterruptibleThread. It may hang when KafkaDataConsumer's methods are interrupted because of KAFKA-1894
21/12/13 14:41:06 WARN KafkaDataConsumer: KafkaDataConsumer is not running in UninterruptibleThread. It may hang when KafkaDataConsumer's methods are interrupted because of KAFKA-1894
21/12/13 14:41:06 WARN KafkaDataConsumer: KafkaDataConsumer is not running in UninterruptibleThread. It may hang when KafkaDataConsumer's methods are interrupted because of KAFKA-1894
21/12/13 14:41:06 WARN KafkaDataConsumer: KafkaDataConsumer is not running in UninterruptibleThread. It may hang when Ka

{'message': 'Waiting for data to arrive',
 'isDataAvailable': False,
 'isTriggerActive': False}

In [12]:
query_ddos = (
    df_ddos_window_watermark
    .selectExpr("to_json(struct(*)) as value")
    .writeStream.format("kafka")
    .trigger(processingTime='30 seconds')
    .outputMode('update')
    .option("kafka.bootstrap.servers", "localhost:9092")
    .option("topic", "clicks-calculated-ddos")
    .option("checkpointLocation", "checkpoints-ddos")
    .queryName("query_ddos")
    .start()
)

# Sleep two seconds
sleep(2)

# Show the status of the query
display(query_ddos.status)

21/12/13 14:41:13 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
21/12/13 14:41:13 WARN KafkaDataConsumer: KafkaDataConsumer is not running in UninterruptibleThread. It may hang when KafkaDataConsumer's methods are interrupted because of KAFKA-1894
21/12/13 14:41:13 WARN KafkaDataConsumer: KafkaDataConsumer is not running in UninterruptibleThread. It may hang when KafkaDataConsumer's methods are interrupted because of KAFKA-1894
21/12/13 14:41:13 WARN KafkaDataConsumer: KafkaDataConsumer is not running in UninterruptibleThread. It may hang when KafkaDataConsumer's methods are interrupted because of KAFKA-1894
21/12/13 14:41:13 WARN KafkaDataConsumer: KafkaDataConsumer is not running in UninterruptibleThread. It may hang when KafkaDataConsumer's methods are interrupted because of KAFKA-1894
21/12/13 14:41:13 WARN KafkaDataConsumer: KafkaDataConsumer is not running in UninterruptibleThread. It may hang when Ka

{'message': 'Processing new data',
 'isDataAvailable': True,
 'isTriggerActive': True}

21/12/13 14:41:15 WARN NetworkClient: [Producer clientId=producer-1] Error while fetching metadata with correlation id 133 : {clicks-calculated-ddos=LEADER_NOT_AVAILABLE}
                                                                                

## Spark helpers

The following code stops all running queries.

In [13]:
sleep(2)

for q in spark.streams.active:
    print("Stopping query '{}' with name '{}'".format(q.id, q.name))
    q.stop()


Stopping query 'ea434011-b31b-4957-a6d9-25ec5ef90777' with name 'query_forbidden'
Stopping query 'c17f064b-33ff-4db2-bd4f-abcdf79b124d' with name 'query_ddos'
