# Ingesting pageview data

In this exercise, you create an HTTP endpoint that ingests page view data (clicks) into the platform. Every time a user clicks on a link and when the user scrolls to certain positions on the page, a json "click" objects gets sent to this endpoint.

This is what a click event looks like.

```json
{
    "visitor_platform": "mobile",
    #
    # Timestamp of the event (milliseconds since unix epoch)
    "ts_ingest": 1515819844345,
    
    "article_title": "Cercanías San Sebastián",
    "visitor_country": "BE",
    
    # Seconds the page was open before this event was sent.
    # (0 when this event is sent immediately after the page was opened.)
    "visitor_page_timer": 0,

    "visitor_os": "ios",
    "article": "https://en.wikipedia.org/wiki/Cercan%C3%ADas_San_Sebasti%C3%A1n",
    
    # How much the user scrolled before this event was sent.
    # (0 when this event is sent while the user hasn't scrolled yet.)
    "visitor_page_height": 0,
    
    "visitor_browser": "unknown"
}
```


We use the [Python Flask framework](http://flask.pocoo.org/) to create the ingest HTTP endpoint. Flask is a lightweight and simple, but very powerful framework to write HTTP webservers in Python. Flask powers the api's of many large web services such as [Netflix](https://medium.com/netflix-techblog/python-at-netflix-bba45dae649e), [Airbnb](https://medium.com/airbnb-engineering/airflow-a-workflow-management-platform-46318b977fd8), [Uber](https://github.com/uber/clay) and [Reddit](https://stackshare.io/reddit/reddit).

## Apache Kafka as event store

The clicks that our API recieves are stored in [Apache Kafka](https://kafka.apache.org/), a distributed streaming platform initially created by LinkedIn. Kafka stores large distributed queues, called *topics* and allows *producers* to send data to the queue and *consumers* to read data from the queue, all in a fault-tolerant and durable way.

The sole responsibility of the Ingest API is to recieve click events from HTTP POST requests and put them on the `clicks` topic in Kafka. The Ingest API itself doesn't do any cleaning or filtering, this happens later in the pipeline. Using Kafka here has a number of advantages.

* Kafka acts as a **buffer** between the ingest of events and the processing events. Downstream issues, such as the processing code crashing, don't affect the ingest of events. This also allows the platform to **gracefully handle spikes in load**. When the processing code can't handle the load, Kafka will gather a backlog of unprocessed events, but the events will not be lost and the processing code can catch up when the load decreases again.
* Kafka allows multiple consumers to subscribe to the same topic. This makes it easy to have **multiple paralell processing pipelines** which recieve the same event stream. You can use this to run multiple versions of your code next to each other in order to do tests or quality assurance. Or to have a staging environment that recieves the same event stream as the production environment.
* Kafka is a **language-agnostic** platform with many client libraries so you can use it to create heterogenous streaming analytics pipelines.

For long-term storage of the event log, you can couple Kafka with Hadoop and create a small service that persists the event log in HDFS. You can then replay the historic log using a producer that reads from HDFS, or you can process the event log in batch using a technology like Hadoop MapReduce or Spark.

So, to start the exercise, let's install the required dependencies.

In [7]:
%%bash
# Install the required Python 3 dependencies
python3 -m pip install kafka-python flask  # type: ignore

Collecting flask
  Downloading Flask-2.0.2-py3-none-any.whl (95 kB)
Collecting itsdangerous>=2.0
  Downloading itsdangerous-2.0.1-py3-none-any.whl (18 kB)
Collecting Werkzeug>=2.0
  Downloading Werkzeug-2.0.2-py3-none-any.whl (288 kB)
Installing collected packages: Werkzeug, itsdangerous, flask
Successfully installed Werkzeug-2.0.2 flask-2.0.2 itsdangerous-2.0.1


# Flask introduction

The following code snippet is everything you need to get a working "Hello World" HTTP api. You start the server by running the cell. (select the cell with the python code and press `ctrl`-`enter`).


> Note: in a Jupyter notebook, only one cell can run at the same time. This webserver cell will keep running until it is killed, however. When you proceed to the next cell, **you need to manually stop this webserver** by clicking on the "stop" button on the toolbar of this notebook.

In [8]:
from flask import Flask

# Create a new webserver.
app = Flask(__name__)

# For each GET request to http://localhost/, send the string "Index Page" as response.
@app.route('/')
def index():
    return 'Index Page\n'

# For each GET request to http://localhost/hello, send the string "Hello, World" as response.
@app.route('/hello')
def hello():
    return 'Hello, World\n'

# Run the webserver and allow requests from any IP.
if __name__ == "__main__":
    app.run(host='0.0.0.0')

 * Serving Flask app '__main__' (lazy loading)
 * Environment: production
[2m   Use a production WSGI server instead.[0m
 * Debug mode: off


 * Running on all addresses.
 * Running on http://172.24.0.3:5000/ (Press CTRL+C to quit)


You can test the server by visiting [http://localhost:5000](http://localhost:5000), or by using the `curl` command in a terminal.

```txt
$ curl http://localhost:5000
Index Page
$ curl http://localhost:5000/hello
Hello, World
```

## Basic functionality

* By adding the `@app.route("/")` annotation to a Python function, you define the URL that will execute it. In the above example, each call to `/` will run the `index` function. Such an annotated function is called a "view function". Take a look at [Flask quickstart - Routing](https://flask.palletsprojects.com/en/1.1.x/quickstart/#routing) for more information.
* The return value of a view function gets processed by [`make_response()`](https://flask.palletsprojects.com/en/1.1.x/api/#flask.Flask.make_response) to turn it into an HTTP response. This function supports many different kinds of arguments. The simplest is a string, which will return the string as response body and use status code 200. You can also use a tuple to specify other response codes.
* Flask has a global variable `flask.request`. You can use this to get the HTTP request headers, body and more. For example, `request.json` is a dictionary generated from the `json` body of the request. See [Incoming Request Data](https://flask.palletsprojects.com/en/1.1.x/api/#incoming-request-data) for more information.

# Kafka introduction

This VSCode workspace sets up two containers. The container you're working in now runs a Spark cluster and Jupyter. The second container runs Apache Kafka; the message queue used in our Kappa architecture. You can access the Kafka cluster from this container by connecting to `localhost:9092`.

Below is an example for how to publish data to Kafka using Python. We use the [`python-kafka`](https://kafka-python.readthedocs.io/en/master/usage.html) package for this. Since Kafka is a distributed _queue_, we write data to Kafka by sending messages. The code writing data is called a producer. The code reading the data is called a consumer.

> Note: if you encounter a `NoBrokersAvailable` error message, that means that the Kafka server is not reachable. Something must have gone wrong with running the containers. We suggest pressing the green "Remote" button on the bottom left of VSCode and choosing "rebuild container". If this doesn't work, you should manually remove all containers and restart VSCode.

In [9]:
from kafka import KafkaProducer

topic ="test-topic"

producer = KafkaProducer(
    bootstrap_servers=['localhost:9092'])

# Note: The data you send must be binary
producer.send(topic, b"Hello World!").get(timeout=30)
producer.send(topic, b"Foo").get(timeout=30)
producer.send(topic, b"Bar").get(timeout=30)

RecordMetadata(topic='test-topic', partition=0, topic_partition=TopicPartition(topic='test-topic', partition=0), offset=8, timestamp=1639361786376, log_start_offset=0, checksum=None, serialized_key_size=-1, serialized_value_size=3, serialized_header_size=-1)

In [10]:
from kafka import KafkaConsumer, TopicPartition


# To consume latest messages and auto-commit offsets
consumer = KafkaConsumer(
    bootstrap_servers=['localhost:9092'],
    auto_offset_reset='earliest',
     # Stop iteration if no message after 0.5sec
    consumer_timeout_ms=500)
tp = TopicPartition(topic,0)
consumer.assign([tp])

# Go to the beginning of the queue
consumer.seek(tp, 0)

for message in consumer:
    # message value and key are raw bytes -- decode if necessary!
    # e.g., for unicode: `message.value.decode('utf-8')`
    print(f"{message.topic}:{message.partition}:{message.offset}: key={message.key} value={message.value}")

test-topic:0:0: key=None value=b'Hello World!'
test-topic:0:1: key=None value=b'Foo'
test-topic:0:2: key=None value=b'Bar'
test-topic:0:3: key=None value=b'Hello World!'
test-topic:0:4: key=None value=b'Foo'
test-topic:0:5: key=None value=b'Bar'
test-topic:0:6: key=None value=b'Hello World!'
test-topic:0:7: key=None value=b'Foo'
test-topic:0:8: key=None value=b'Bar'


## Clearing Jupyter Notebook Output

Since the server writes a log line to output for every request, it's possible to **write so much output that the notebook hangs**. Therefore, it's best to clear the output every few messages.

You can clear the output using Python. It's advised to clear the output at least every 500 messages.

In [11]:
from IPython.display import clear_output

print("This line will be cleared.")
print("This line will be cleared too.")
clear_output()
print("This line will be visible.")
print("This is visible too!")

This line will be visible.
This is visible too!


## Endpoint API

> **Task:** Write a server that listens for `POST` requests to the url `/clicks`, reads the body of the request as `json`, and sends sends that body to the `clicks` topic on Kafka.

tips:

* Use KafkaProducer for writing to Kafka.
* Remember data sent to Kafka needs to be in binary format.
  * Turn Python dictionaries into a json string using `string = json.dumps(python_object)`
  * Turn the json string into binary format using `string.encode('utf-8')`
* Use Flask for the webserver.
  * Get the request body with `request.json`.
  * In order to respond to POST requests, add `methods=['POST']` to the `app.route()` of a function.
  * The return value of the view functions get processed by [`make_response`](https://flask.palletsprojects.com/en/1.1.x/api/#flask.Flask.make_response). Return a tuple with a response body and the status code.
    * If the request is malformed, respond with the HTTP status code `400` and include in the body of the response a message explaining the issue.
    * If writing the rating to Kafka succeeds, respond with the HTTP status code `200`.
* Clear the Jupyter notebook output every X requests using `clear_output()`.


Use `1b-fake-website.ipynb` to simulate click data. You can use the scripts in `debug.ipynb` to read from Kafka topics to see if the messages are pushed correctly.

The [postman app](https://www.postman.com/downloads/) is a useful GUI tool to create HTTP calls for testing. You can also use curl to test the api from the cli:

```bash
curl --header "Content-Type: application/json" \
     --request POST \
     --data '{"visitor_platform": "mobile","ts_ingest": 1515819844345,"article_title": "Cercanías San Sebastián","visitor_country": "BE","visitor_page_timer": 0,"visitor_os": "ios","article": "https://en.wikipedia.org/wiki/Cercan%C3%ADas_San_Sebasti%C3%A1n","visitor_page_height": 0,"visitor_browser": "unknown"}' \
     http://localhost:5000/clicks
```

In [12]:
import json
import datetime
import uuid
import re
import os
from flask import Flask, request, Response, send_file, send_from_directory
from kafka import KafkaProducer
from IPython.display import clear_output

# Create a Kafka producer. The server is available at localhost:9092
producer = KafkaProducer(bootstrap_servers=['localhost:9092'])

# Variable to count how many lines the server wrote to stdout
i = 0

# Create a new webserver
app = Flask(__name__, static_url_path='')

# Handle POST requests to http://localhost/clicks
@app.route('/clicks', methods=['POST'])
def extranm():
    # Clear the output after every 1000 requests
    global i
    if i > 1000:
        clear_output()
        i = 0
    i = i+1
    
    # Get the json body of the POST request as a Python dictionary
    rjson = request.json
    
    # If parsing the json failed, return a 400 HTTP error code
    if not rjson:
        return (
            json.dumps({'success': False, 'message': 'could not decode json'}),
            400,
            {'ContentType':'application/json'}
        )

    # If parsing succeeded, try for 5 seconds to send the event to Kafka
    producer.send('clicks', json.dumps(rjson).encode('utf-8')).get(timeout=5)

    # If sending succeeded, respond to the client that all is wel
    return (
        json.dumps({'success': True}),
        200,
        {'ContentType':'application/json'}
    )

# Run the webserver
app.run(host='0.0.0.0')

 * Serving Flask app '__main__' (lazy loading)
 * Environment: production
[2m   Use a production WSGI server instead.[0m
 * Debug mode: off


 * Running on all addresses.
 * Running on http://172.24.0.3:5000/ (Press CTRL+C to quit)
