# DDoS and Intrusion detection


In this exercise, we'll use Spark structured streaming to detect DDoS attacks and attempts to access the admin panel of the website.

* Use the [fake-ddos](fake-ddos.ipynb) notebook to simulate a DDoS attack.
* Use the [fake-intrusion](fake-intrusion.ipynb) notebook to simulate an intrusion attempt.

In [None]:
%%bash
# Ensure the required Python 3 dependencies are installed.
python3 -m pip install kafka-python

Create a Spark context and specify that the python spark-kafka libraries need to be added.

In [None]:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3 pyspark-shell'

import pyspark 
from pyspark import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils


sc = SparkContext()
sc.setLogLevel("WARN")
spark = SparkSession(sc)

Create a streaming DataFrame that represents the events received from the Kafka topic `clicks-cleaned`.

In [None]:
df = spark.readStream.format("kafka") \
    .option("kafka.bootstrap.servers","localhost:9092") \
    .option("subscribe", "clicks-cleaned") \
    .option("startingOffsets", "latest") \
    .option("failOnDataLoss", "false") \
    .load()

Cast the json to columns in the DataFrame. Make sure to use TimestampType for the `ts_ingest` since we already converted it in the `clean` notebook.

In [None]:
schema = StructType([
    StructField("visitor_platform", StringType()),
    StructField("ts_ingest", TimestampType()),
    StructField("article_title", StringType()),
    StructField("visitor_country", StringType()),
    StructField("visitor_os", StringType()),
    StructField("article", StringType()),
    StructField("visitor_browser", StringType()),
    StructField("visitor_page_timer", IntegerType()),
    StructField("visitor_page_height", IntegerType()),
])

dfs = df.selectExpr("CAST(value AS STRING)") \
      .select(from_json(col("value"), schema) \
      .alias("clicks"))

df_data = dfs.select("clicks.*")

Create a [user-defined function (`udf`)](https://docs.databricks.com/spark/latest/spark-sql/udf-python.html) `forbidden_clicks` which takes a URL as input and returns `True` if the URL points to the admin part of the website (when it ends with `/admin`).

As an example, the following code creates a UDF which squares each value of a column. It is used on the "id" column and the resulting column's name is changed to "id_squared".

```python
from pyspark.sql.functions import udf
@udf("long")
def squared_udf(s):
  return s * s
df = spark.table("test")
display(df.select("id", squared_udf("id").alias("id_squared")))
```

In [None]:
@udf
def forbidden_clicks(click_url):
    return click_url.endswith('/admin')

Use the UDF to create the dataframe `df_forbidden` which contains the collumn `forbidden` which specifies if the URL is an admin URL.

In [None]:
# For every article url check if it is a forbidden url.
# We can not use the map() function here, dataframes do not support this anymore since version 2.0.
# Under the hood calling map() on a dataframe would transform it to an RDD which is not allowed in structured streaming.
# This means you can use only DataFrame or SQL. Conversion to RDD (or DStream or local collections) are not supported.
# Because of this we will use a User Defined Function (UDF) to execute some Pyhton code on a column.
df_forbidden = df_data.select('article', forbidden_clicks('article').cast('boolean').alias('forbidden'))

We'll do the same for detecting ddos attacks. First we want to flag whether an individual event is suspicious, i.e. whether the page_timer and page_height are both `0`. However, this time we'll use a `pandas_udf`.

[Regular Python UDF's have the disadvantage that they operate on one row at a time](https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html), causing them to suffer from high serialization and invocation overhead. Pandas UDF's are built on top of Apache Arrow to support high-performant UDF's in Python.

This is the squared_udf converted to a pandas udf.

```python
from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf('long', PandasUDFType.SCALAR)
def squared_pandas_udf(s):
    return s * s

df = spark.table("test")
display(df.select("id", squared_udf("id").alias("id_squared")))
```

The regular UDF version works one row at a time: the user-defined function takes a long `s` and returns the result of `s*s` as a long. In the Pandas version, the user-defined function takes a pandas.Series `s` and returns the result of `s*s` as a pandas.Series. Because `s*s` is vectorized on `pandas.Series`, the Pandas version is much faster than the row-at-a-time version.

Note that there are two important requirements when using scalar pandas UDFs:

* The input and output series must have the same size.
* How a column is split into multiple pandas.Series is internal to Spark, and therefore the result of user-defined function must be independent of the splitting.


In [None]:
# Window over last X seconds, count number of 'visitor_page_timer' and 'visitor_page_height' == 0
@pandas_udf('boolean', PandasUDFType.SCALAR)
def ddos_flagged(page_timer, page_height):
    return (page_timer == 0) & (page_height == 0)
# use ddos_flagged to create df_ddos, where all suspicious events are flagged.
df_ddos = df_data.select("*", ddos_flagged('visitor_page_timer', 'visitor_page_height').alias('flagged'))

The second step in detecting a ddos attack is counting how many suspicious events happen within a certain timeframe. For this, well combine `groupBy` and a 30 seconds `window` based on the `ts_ingest` timestamp.

In [None]:
df_ddos_window = df_ddos.groupBy(
    window(df_ddos.ts_ingest, '30 seconds'),
    df_ddos.flagged
).count()

Now run these queries and write the output to `clicks-calculated-forbidden` and `clicks-calculated-ddos`. Use a trigger with `processingTime = "30 seconds"` for the ddos query so that the next interval is only calculated 30 seconds after the first interval starts.

In [None]:
# Debug dataframes in terminal
# query_data = df_data.writeStream.outputMode("append").option("truncate", "false").format("console").start()
# query_forbidden = df_forbidden.writeStream.outputMode("append").option("truncate", "false").format("console").start()
# query_ddos = df_ddos_window.writeStream.outputMode("update").option("truncate", "true").format("console").start()

query_forbidden = df_forbidden.selectExpr("to_json(struct(*)) as value") \
    .writeStream.format("kafka") \
    .outputMode('update') \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("topic", "clicks-calculated-forbidden") \
    .option("checkpointLocation", "checkpointsforbidden") \
    .start()

query_ddos = df_ddos_window.selectExpr("to_json(struct(*)) as value") \
    .writeStream.format("kafka") \
    .trigger(processingTime='30 seconds') \
    .outputMode('update') \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("topic", "clicks-calculated-ddos") \
    .option("checkpointLocation", "checkpointsddos") \
    .start()

In [None]:
spark.streams.awaitAnyTermination()