### Gabriel Ohaike

### Introduction:
 As a data scientist at a game development company, I am interested in tracking two events from my latest mobile game. Buy a sword & join guild. Each has metadata characterstic of such events.

### Tasks:
        
**In order to do this**, 

  1. I will instrument my API server to log events to Kafka
  
  2. Assemble a data pipeline to catch these events using Spark streaming to filter select event types from Kafka, land them into HDFS/parquet to make them available for analysis using Presto
  
  3. Use Apache Bench to generate test data for my pipeline.
 

**Create A docker Compose file**:

The first thing to do is create `docker compose file` that contains all the containers needed to successfully execute the events tracking. The container is made up of **zookeeper, kafka, claudera, spark, presto and mids** container. To see the content structure and port structure please refer to `docker-compose.yml`.

**Here is an example of one of the containers**

In [None]:
presto:
    image: midsw205/presto:0.0.1
    hostname: presto
    volumes:
      - ~/w205:/w205
    expose:
      - "8080"
    environment:
      HIVE_THRIFTSERVER: cloudera:9083
    extra_hosts:
      - "moby:127.0.0.1"

**Spin up cluster:**

In [None]:
docker-compose up -d

To spins up the container.The docker-compose up aggregates the output of each container in the docker-compose.yml file and -d starts the containers in the background and leave them running.

**Create a topic:**

In [None]:
docker-compose exec kafka kafka-topics --create --topic events --partitions 1 --replication-factor 1 --if-not-exists --zookeeper zookeeper:32181

This code is to create a kafka topic. The exec is use to issue a command expecially when the container is running multiple services. Next, `kafka kafka-topics` tells docker-compose to create a kafka topic. `create --topic events` now create a topc called assessment.partitions 1 allows topics to be parallelized by spitting in data into a particular topic across a multiple brokers. We are only interested in one partition here as per project reqirement, hence the number of `partitions is 1 with the replication factor of 1`. This defines the replication implimented at the partition level. Since we are only interested in one kafka topic, we set our replication factor as 1. --if-not-exists tells the command to execute only if topic does not exist, this avoids errors/warnings.--zookeeper zookeeper:32181 Here the option zookeeper is telling our connection zookeeper to connect to port 32181.

**Create a web-based application:**

In [None]:
import json
from kafka import KafkaProducer
from flask import Flask, request

app = Flask(__name__)
producer = KafkaProducer(bootstrap_servers='kafka:29092')


def log_to_kafka(topic, event):
    event.update(request.headers)
    producer.send(topic, json.dumps(event).encode())
    
@app.route("/purchase_a_sword")
def purchase_a_sword():
    purchase_sword_event = {'event_type': 'purchase_sword'}
    log_to_kafka('events', purchase_sword_event)
    return "Sword Purchased!\n"

The web app is called `game_api.py` in the folder. The code above shows some of the implementation processes. The `game_api.py` contains three main event. To process this, the `mobile app` makes an `API` call to the `web-based API server` with any of the following calls

   1. `default responses:`
        This returns a default response "This is the default response "
    
   2. `purchase_a_sword:`
        This api is called when the user want to purchase a sword. It ruturns "Sword purchased"
    
   3. `join_a_guild:`
       This is called when a user want to join a guild. It returns "Joined a Guild"
   
 In creating our web-based application, we import `Flask` class and create an instance of class called `app = Flask(__name__)`. We also import `KafkaProducer` to read from kafka using bootstrap.servers configuration to connect to `kafka:29092.` Next, we defined a function `log_to_kafka` to log events to kafka, update event header and use the `send` to send event to kafka `producer` and dump event to `json`before we log it kafka. Encoded with `encode()` for UTF8.  The `route()` decorator tells `Flask` what `url` should trigger the function.

### Streaming Set up

First thing we need to do is getting our set up ready to stream events. To do this, we file

   _1. ab.sh_
   
   _2. guild_sword_stream.py_

### ab.sh:

The `ab.sh` file uses `mids` container with `apache bench` denoted as `ab` to generate data. In the code example below, we are simply using `ab` to generate `150` purchases events from `user1` using a localhost:5000. The `.sh` script controls how we want our events to run during streaming.

In [None]:
docker-compose exec mids ab -n 150 -H "Host:user1.comcast.com" http://localhost:5000/purchase_a_sword
docker-compose exec mids ab -n 10 -H "Host: user1.comcast.com" http://localhost:5000/join_guild

### guild_sword_stream.py

The `guild_sword_stream.py` is used to define events schema, filter out events of interest and load into a `json` file and send to spark to extract events and write it into `HDFS`. Let's look at it in details to understand how different pieces contribute to the overall streaming process.



This extracted events is now fed to streaming mode as request comes in. Here, because our client is only interested in `sword purchases and join guild`, only these two would be filtered. These filtered events is `cast` to string. The `CAST()` function converts a value (of any type) into a specified datatype. In this case, we are converting value from the root to string. Finally,we `write` the stream events to `HDFS` using a processing time of `10 seconds`.