# Week 5: Data Ingestion (Kafka)


![](https://camo.githubusercontent.com/56166d361c3975dee750ecce16d605bbbf66516b/68747470733a2f2f75706c6f61642e77696b696d656469612e6f72672f77696b6970656469612f636f6d6d6f6e732f352f35332f4170616368655f6b61666b615f776f7264747970652e737667)

### Student ID: [#####]
### Subtasks Done: [#,#,..]

# Working with sensor data

We want to monitor the status of three smart buildings.
Each building has 8 floors and each floor has 20 rooms, that have a max capacity of 10 people each.

Rooms are equipped with sensors that counts how many people are currently inside the rooms. 

Due to COVID-19, we want monitor how many people are in the various rooms, floors, and buildings.

![](./buildings.png)

# Notes before starting!!!
 
- Look at the whole notebook (tasks) before starting
    - (Hint: task 2 and 3 depends on 0 and 1)
- you can create as many topics as you want 
    - (Hint 3 or more)
- each topic in the exercise should have **at least** 2 partitions. 
    - (HINT: to decide how many partition look at task 3)
- we assume a replication factor of 1 for all the topics is sufficient.
- The minimal required dependencies have been already imported.

## Task 0: Setting the environment

In [1]:
from confluent_kafka import SerializingProducer, DeserializingConsumer
from confluent_kafka.serialization import StringSerializer, StringDeserializer
from confluent_kafka.serialization import IntegerSerializer, IntegerDeserializer
from confluent_kafka.admin import AdminClient, NewTopic, NewPartitions
from uuid import uuid4
import sys, lorem, random, time, json, csv

brokers = "kafka1:9092,kafka2:9093"
topics = ["ingestion2", "by_floor", "by_building"] ## Add here your topics

####  Create new topics

In [4]:
new_topics = [NewTopic(topic, num_partitions=3, replication_factor=1) for topic in topics]
a = AdminClient({'bootstrap.servers': brokers})

a.create_topics(new_topics)

{'ingestion': <Future at 0x7f71d43e1550 state=running>,
 'by_floor': <Future at 0x7f71d43e1460 state=running>,
 'by_building': <Future at 0x7f71d43e1670 state=running>}

In [8]:
new_parts = [NewPartitions(topics[0], int(24))]
fs = a.create_partitions(new_parts, validate_only=False)

In [21]:
for t in a.list_topics().topics.values():
    print(t)
    print(t.partitions.values())

buildings
dict_values([PartitionMetadata(0), PartitionMetadata(1), PartitionMetadata(2)])
t4
dict_values([PartitionMetadata(0)])
by_floor
dict_values([PartitionMetadata(0), PartitionMetadata(1), PartitionMetadata(2)])
by_building
dict_values([PartitionMetadata(0), PartitionMetadata(1), PartitionMetadata(2)])
_schemas
dict_values([PartitionMetadata(0)])
t3
dict_values([PartitionMetadata(0), PartitionMetadata(1)])
floors1
dict_values([PartitionMetadata(0)])
rooms
dict_values([PartitionMetadata(0), PartitionMetadata(1), PartitionMetadata(2)])
__consumer_offsets
dict_values([PartitionMetadata(0), PartitionMetadata(1), PartitionMetadata(2), PartitionMetadata(3), PartitionMetadata(4), PartitionMetadata(5), PartitionMetadata(6), PartitionMetadata(7), PartitionMetadata(8), PartitionMetadata(9), PartitionMetadata(10), PartitionMetadata(11), PartitionMetadata(12), PartitionMetadata(13), PartitionMetadata(14), PartitionMetadata(15), PartitionMetadata(16), PartitionMetadata(17), PartitionMetadata(

## Task 1: Counting People

Write a Kafka Producer that generates the observations every 5 seconds (system time)
for each building, floor, and room, and pushes them to a topic.

We recommend "murmur2_random" as partitioner.

#### Populate the topics with 1000 observations

In [5]:
pconf = {
    'bootstrap.servers': brokers,
    'partitioner': 'murmur2_random',
    'key.serializer': StringSerializer('utf_8'),
    'value.serializer':  StringSerializer()
}

In [6]:
p = SerializingProducer(pconf)

#### Populate a topic with the 1000 observations in obs.csv, sending one every 5 seconds (system time)

#### Hints:
- represent the message as a json
- use the a random key (check the json)

In [7]:
f = open('obs.csv', 'r')

In [8]:
with f:
    reader = csv.reader(f)


In [None]:
f = open('obs.csv', 'r')
with f:
    reader = csv.reader(f)
    header = next(reader)
    for row in reader:
        try:
    
            k = { 'building': row[2], 'floor': row[3] }
            v = { 'count': row[1], 'uuid': row[0], 'room': row[4]}

            p.produce(topics[0], key=json.dumps(k), value=json.dumps(v))
            p.poll(0)
            p.flush()
            time.sleep(5)
        except KeyboardInterrupt:
            sys.stderr.write('%% Aborted by user\n')
        except BufferError:
            sys.stderr.write('%% Local producer queue is full (%d messages awaiting delivery): try again\n' % len(p))

%% Aborted by user


## Task 2: Reading observations

Write a Kafka Consumer that reads the previous topic and prints the result out.

In [None]:
consumer_conf = {
    'bootstrap.servers': brokers,
    ## Your Configuration Code Here
}

In [None]:
## Your CONSUMER Code Here

#### Consume 1000 observations

In [None]:
try:
    for i in range(0,1000):
        ## Your consuming Code Here
except KeyboardInterrupt:
    sys.stderr.write('%% Aborted by user\n')
finally:
    # Close down consumer to commit final offsets.
    consumer.close()

## Task 3: Aggregating the number of people 

Write a Kafka Consumer that reads the previous topics and count
the number of people per floor and per building every minute,

Always ensure the result are durable (save them in a topic)

Carry on the minimal ammount of information in the key and the value (remove unnecessary information)




##### HINT: How did you organize the data in partitions?

#### Change the message key to simplify counting by floor.

In [None]:
consumer_conf = {
    'bootstrap.servers': brokers,
    ## Your Configuration Code Here
}

In [None]:
## Your consumer Code Here

In [None]:
pconf = {
    'bootstrap.servers': brokers,
       ## Your Configuration Code Here
}

In [None]:
## Your producer Code Here

In [None]:
try:
    for i in range(0,1000):
          # Your consuming and producing code here
except KeyboardInterrupt:
    sys.stderr.write('%% Aborted by user\n')

### Total number of people Per Floor

keep the local count of people on each floor. Floor are uniquely identified by building and floor number. 

In [None]:
consumer_conf = {
    'bootstrap.servers': brokers,
    ## Your configuration code here
}

In [None]:
## Your Consumer Code Here

### Use the following dictionary to maintain aggregate the results

In [None]:
floors = {}

In [None]:
try:
    for i in range(0,1000):
            # Your consuming code here
except KeyboardInterrupt:
    sys.stderr.write('%% Aborted by user\n')

## Let's visualize the results

In [None]:
import numpy as np
import matplotlib.pyplot as plt
plt.bar(floors.keys(), floors.values(), color='g')

##  Let's save the aggregated result in a topic and progress from there.

In [None]:
pconf = {
    'bootstrap.servers': brokers,
       ## Your Configuration Code Here
}

In [None]:
## Your producer Code Here 

In [None]:
try:
    for f in floors.keys():
       ## Your producering Code Here 
except KeyboardInterrupt:
    sys.stderr.write('%% Aborted by user\n')

## Total number of people per building

In [None]:
consumer_conf = {
    'bootstrap.servers': brokers,
   ## Your configuration Code Here
}

In [None]:
## Your consumer Code Here

In [None]:
try:
    for i in range(1,1000):
         ## Your consuming Code Here
except KeyboardInterrupt:
    sys.stderr.write('%% Aborted by user\n')

## Let's visualize the results

In [None]:
import numpy as np
import matplotlib.pyplot as plt
plt.bar(buildings.keys(), buildings.values(), color='g')

### Draw the dataflow between topic using a tool of choice

![](http://placehold.it/256x256)

## Optional Tasks (but useful for preparing the final examm)

## Task 4: Add a 1 minute window to the aggreagtion (see wordcount example)

## Task 5 Redo Task 0-3 modelling observations using AVRO.