<a href="https://colab.research.google.com/github/Rogerio-mack/data-engineering/blob/main/Aula_kafka_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://github.com/Rogerio-mack/data-engineering/blob/main/mack_logo.png?raw=true" height="70" align="right"/></a>






# Kafka and Spark Streaming in Colab

> Kafka $\longrightarrow$ Spark  $\longrightarrow$ ML, nem tudo implementado aqui, mas você pode tentar...


In [1]:
!pip install kafka-python

Collecting kafka-python
  Downloading kafka_python-2.0.2-py2.py3-none-any.whl.metadata (7.8 kB)
Downloading kafka_python-2.0.2-py2.py3-none-any.whl (246 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m246.5/246.5 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: kafka-python
Successfully installed kafka-python-2.0.2


### Import packages

In [2]:
import os
from datetime import datetime
import time
import threading
import json
from kafka import KafkaProducer
from kafka.errors import KafkaError
import pandas as pd
from sklearn.model_selection import train_test_split

# Download e Setup: Kafka e Zookeeper

Esta é uma versão antiga, mas deve funcionar com versões mais novas... você pode tentar.



In [3]:
!curl -sSOL https://archive.apache.org/dist/kafka/2.7.0/kafka_2.13-2.7.0.tgz
!tar -xzf kafka_2.13-2.7.0.tgz

Setup local,

- Kafka, Brokers 127.0.0.1:9092
- Zookeeper, Node 127.0.0.1:2181

In [4]:
!./kafka_2.13-2.7.0/bin/zookeeper-server-start.sh -daemon ./kafka_2.13-2.7.0/config/zookeeper.properties
!./kafka_2.13-2.7.0/bin/kafka-server-start.sh -daemon ./kafka_2.13-2.7.0/config/server.properties
!echo "Waiting for some seconds until kafka and zookeeper services are up and running"
!sleep 15

Waiting for some seconds until kafka and zookeeper services are up and running


# Create Topics

- train: partitions=2, replication-factor=1
- test: partitions=1, replication-factor=1

**Note** Você não pode criar replication-factor=2, pois há um único server aqui. Até **100 nodes** podem ser configurados.

In [13]:
%%script echo skipping
!./kafka_2.13-2.7.0/bin/kafka-topics.sh --delete --bootstrap-server 127.0.0.1:9092 --topic test
!./kafka_2.13-2.7.0/bin/kafka-topics.sh --delete --bootstrap-server 127.0.0.1:9092 --topic train

In [14]:
!./kafka_2.13-2.7.0/bin/kafka-topics.sh --create --bootstrap-server 127.0.0.1:9092 --replication-factor 1 --partitions 2 --topic train
!./kafka_2.13-2.7.0/bin/kafka-topics.sh --create --bootstrap-server 127.0.0.1:9092 --replication-factor 1 --partitions 1 --topic test


Created topic train.
Created topic test.


In [15]:
!./kafka_2.13-2.7.0/bin/kafka-topics.sh --describe --bootstrap-server 127.0.0.1:9092 --topic train
!./kafka_2.13-2.7.0/bin/kafka-topics.sh --describe --bootstrap-server 127.0.0.1:9092 --topic test

Topic: train	PartitionCount: 2	ReplicationFactor: 1	Configs: segment.bytes=1073741824
	Topic: train	Partition: 0	Leader: 0	Replicas: 0	Isr: 0
	Topic: train	Partition: 1	Leader: 0	Replicas: 0	Isr: 0
Topic: test	PartitionCount: 1	ReplicationFactor: 1	Configs: segment.bytes=1073741824
	Topic: test	Partition: 0	Leader: 0	Replicas: 0	Isr: 0


# Any dataset... `iris`


In [20]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['target'] = iris.target

iris_df.head()



Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [21]:
train_df, test_df = train_test_split(iris_df, test_size=0.4, shuffle=True)
print("Number of training samples: ",len(train_df))
print("Number of testing sample: ",len(test_df))

x_train_df = train_df.drop(["target"], axis=1)
y_train_df = train_df["target"]

x_test_df = test_df.drop(["target"], axis=1)
y_test_df = test_df["target"]



Number of training samples:  90
Number of testing sample:  60


In [22]:
# Os rótulos são definidos como chaves de mensagem do kafka para armazenar dados em múltiplas partições permitindo recuperação eficiente de dados

x_train = list(filter(None, x_train_df.to_csv(index=False).split("\n")[1:]))
y_train = list(filter(None, y_train_df.to_csv(index=False).split("\n")[1:]))

x_test = list(filter(None, x_test_df.to_csv(index=False).split("\n")[1:]))
y_test = list(filter(None, y_test_df.to_csv(index=False).split("\n")[1:]))

NUM_COLUMNS = len(x_train_df.columns)
len(x_train), len(y_train), len(x_test), len(y_test)

(90, 90, 60, 60)

# Store data in kafka


In [23]:
def error_callback(exc):
    raise Exception('Error while sendig data to kafka: {0}'.format(str(exc)))

def write_to_kafka(topic_name, items):
  count=0
  producer = KafkaProducer(bootstrap_servers=['127.0.0.1:9092'])
  for message, key in items:
    producer.send(topic_name, key=key.encode('utf-8'), value=message.encode('utf-8')).add_errback(error_callback)
    count+=1
  producer.flush()
  print("Wrote {0} messages into topic: {1}".format(count, topic_name))

In [24]:
write_to_kafka("train", zip(x_train, y_train))
write_to_kafka("test", zip(x_test, y_test))

Wrote 90 messages into topic: train
Wrote 60 messages into topic: test


In [26]:
!/content/kafka_2.13-2.7.0/bin/kafka-console-consumer.sh \
--bootstrap-server localhost:9092 \
--topic train \
--from-beginning

4.6,3.1,1.5,0.2
6.7,2.5,5.8,1.8
6.3,2.9,5.6,1.8
6.4,2.7,5.3,1.9
7.1,3.0,5.9,2.1
7.3,2.9,6.3,1.8
5.0,3.2,1.2,0.2
6.5,3.2,5.1,2.0
6.0,3.0,4.8,1.8
6.3,2.8,5.1,1.5
4.9,2.5,4.5,1.7
5.0,3.0,1.6,0.2
6.2,2.8,4.8,1.8
5.8,2.8,5.1,2.4
6.7,3.3,5.7,2.5
5.1,3.8,1.6,0.2
6.3,3.3,6.0,2.5
6.1,2.6,5.6,1.4
7.7,3.0,6.1,2.3
7.9,3.8,6.4,2.0
5.7,3.8,1.7,0.3
5.2,3.4,1.4,0.2
5.8,2.7,5.1,1.9
5.1,3.8,1.9,0.4
4.9,3.1,1.5,0.1
4.7,3.2,1.3,0.2
4.4,3.2,1.3,0.2
7.7,3.8,6.7,2.2
6.5,3.0,5.5,1.8
5.7,4.4,1.5,0.4
4.8,3.4,1.9,0.2
4.8,3.4,1.6,0.2
6.4,2.8,5.6,2.1
5.0,3.6,1.4,0.2
7.6,3.0,6.6,2.1
4.3,3.0,1.1,0.1
4.6,3.2,1.4,0.2
5.1,3.5,1.4,0.2
5.8,4.0,1.2,0.2
4.8,3.0,1.4,0.1
6.7,3.1,5.6,2.4
4.9,3.1,1.5,0.2
7.2,3.0,5.8,1.6
5.5,4.2,1.4,0.2
5.8,2.7,5.1,1.9
6.4,3.2,5.3,2.3
5.0,3.4,1.6,0.4
7.4,2.8,6.1,1.9
5.1,3.8,1.5,0.3
6.3,2.5,5.0,1.9
5.4,3.4,1.7,0.2
5.9,3.0,5.1,1.8
5.0,3.4,1.5,0.2
4.8,3.0,1.4,0.3
6.9,3.2,5.7,2.3
5.0,3.3,1.4,0.2
6.9,3.1,5.1,2.3
5.0,3.5,1.6,0.6
4.9,3.0,1.4,0.2
6.7,3.3,5.7,2.1
5.2,3.5,1.5,0.2
4.9,3.6,1.4,0.1
6.0,2.2,

# Next: Spark Streaming, ML...