# Kappa Architecture hands-on lab

During the previous courses, you've been introduced to a number of tools and techniques to store and process data. This hands-on lab explains how you can combine these tools and techniques to get an end-to-end data processing pipeline.

## Use case: a Flemish website

Throughout this lab, you will build a platform to capture and analyze traffic to a website. You will build the following components:

* An HTTP endpoint to gather pageview data from a website
* A data cleaning pipeline for the pageview data
* A security alerting system that detects DDoS attacks and requests trying to access your management platform.
* A real-time dashboard showing the number of users on your website at any time.

You'll also learn how to load the pageview data into a time-series database so it's accessible for Business Intelligence tools.

The dataset used in this lab is a three-hour window of page view data of a popular Flemish news website. The dataset is anonymized and the URL's are changed to point to Wikipedia pages in order to protect the privacy of the users and the website owner. Although the dataset of this three hour window is only 500 MB in size, the original dataset's compressed size is ~1 TB. We used a similar setup to the one explained in the lab to analyze and predict the popularity of news articles.

## Introduction to Kappa Architecture

[The Kappa Architecture](http://milinda.pathirage.org/kappa-architecture.com/) is Big Data processing design pattern. It is a high-level description of how to combine data analytics tools to solve real problems. This isn't a one size fits all solution. Many other data processing design patterns exist such as [the Lambda architecture](http://lambda-architecture.net/) and [the Zeta architecture](https://mapr.com/solutions/zeta-enterprise-architecture/).




The Kappa architecture focuses around "events" and uses an event log as the source of truth. So instead of storing the final state of your data you store each individual event, so for the platform of this lab we'll store each individual page load and use that event log as our source of truth. We then calculate the visitors per minute from that event log and store the result in a *serving database*. Apps and tools who need to know the number of visitors connect to the serving database to get their information, just like they would in a traditional data processing architecture. The actual raw page view data, however, is still available for reprocessing. The Kappa architecture is ideal for time-series data such as page views, logs or IoT sensors. 


Below you see the Kappa architecture applied to our website pageviews example.

1. The website contains javascript code that sends an HTTP POST request to the ingest API when the user first loads the website and when the user scrolls down.
1. The ingest API publishes each individual event on the "clicks" topic on Kafka, a distributed queue.
1. Spark Streaming jobs subscribe to the "clicks" topic, process the clicks and write the processed data to serving databases and new Kafka topics.
1. Applications such as the Tableau BI platform or Grafana connect to a serving database and use the data.

<img src="img/Big Data Hands On - Kappa.png">

This has enormous flexibility.

* If you update the stream processing code that generates the databases, you can easily replay all the events in order to create a new version of the serving layer. This can happen in parallel to the existing serving layer.
* It's easy to have multiple different types of databases in the serving layer, each optimized for a specific application but built from the same dataset.
* It's easily horizontally scalable but can still integrate with non-scalable applications.





# Exercises

Below is the architecture of the Kappa-inspired platform you'll build in this lab. Specifically, you'll create the following components, each in their own notebook. Start with the first component and come back to this notebook to continue with the next component when the first component is finished.

1. [Create an ingest API using the Python Flask framework](endpoint.ipynb).
2. [Clean the data using Spark Structured Streaming](cleanup.ipynb).
3. [Generate security alerts](security.ipynb) and [generate a notification when such an alert happens.](security-notifications.ipynb)
4. [Transform the data and load it into InfluxDB](serving.ipynb)
5. [Create a realtime dashboard using Spark](dashboard-generation.ipynb).


## Important notes

When working on UGain desktops (or laptops with similar specs) you have to be careful with the resources available. Running the full lab at the same time is **not** possible. For each part of the lab we describe which notebooks should be running. If the notebook is not listed, it is expected to be shutdown.

- Part 1 (Ingest API):
	-  [endpoint.ipynb](endpoint.ipynb)
	-  [fake-website.ipynb](fake-website.ipynb)
	-  [debug.ipynb](debug.ipynb)
- Part 2 (Data cleanup):
	-  [endpoint.ipynb](endpoint.ipynb)
	-  [fake-website.ipynb](fake-website.ipynb)
	-  [debug.ipynb](debug.ipynb)
	-  [cleanup.ipynb](cleanup.ipynb)
- Part 3 (Security):
	-  [endpoint.ipynb](endpoint.ipynb)
	-  [fake-website.ipynb](fake-website.ipynb)
	-  [debug.ipynb](debug.ipynb)
	-  [cleanup.ipynb](cleanup.ipynb)
	-  [security.ipynb](security.ipynb)
	-  [security-notifications.ipynb](security-notifications.ipynb)
	-  [fake-ddos.ipynb](fake-ddos.ipynb) (**if fake-intrusion is not running**)
	-  [fake-intrusion.ipynb](fake-intrusion.ipynb)(**if fake-ddos is not running**)
- Part 4 (serving):
	-  [endpoint.ipynb](endpoint.ipynb)
	-  [fake-website.ipynb](fake-website.ipynb)
	-  [debug.ipynb](debug.ipynb)
	-  [cleanup.ipynb](cleanup.ipynb)
	-  [serving.ipynb](serving.ipynb)

**Always save** the notebook **before** running cells. If there is a resource problem the browser might have to be restarted and code might be lost.


<img src="img/use-case-overview.png">



# Appendix 1: Code Completion in Jupyter

Jupyter has code completion abilities, but it doesn't show suggestions by default.

**Show suggestions by pressing `tab` after a dot.**

![image.png](img/suggestions.png)

**Show function arguments by pressing `shift` - `tab` after the opening bracket of a function.**

![image.png](img/function-docs.png)

# Appendix 2: Resetting the environment

All data in Kafka and InfluxDB is stored inside of their docker containers. Spark runs inside of the notebook container. You can remove this data and start from a fresh instance by stopping and removing their docker containers.

```bash
docker stop kafka; docker rm kafka
docker stop influxdb; docker rm influxdb
docker stop notebook; docker rm notebook
```

If there are any checkpoint directories remove those as well. Checkpoint directories are created after running kafka sink queries and store the state of which messages to process next.

```bash
rm -rf checkpoints*
```

After these operations you can create new containers.

```bash
# Start Kafka
docker start kafka || docker run -d -p 2181:2181 -p 9092:9092 --env ADVERTISED_HOST=0.0.0.0 --env ADVERTISED_PORT=9092 --name kafka spotify/kafka

# Start InfluxDB
docker start influxdb || docker run -d -p 8086:8086 --name influxdb influxdb
cd ~/Desktop/kappa-course

# Start the notebook (with Spark)
docker start -i notebook || docker run -it --net="host" -v "$PWD":/home/jovyan/work --name notebook jupyter/pyspark-notebook
```

