When it comes to projects, I'm a firm believer that you should put yourself in weird situations, maybe you run into something similar while working one day, so just wing it.

Now that we have the parquet file, I have no idea how to stream that. Pyspark can stream directly to kafka, however it does so with no regards to time column. So ... maybe use a database and run a query like ```SELECT * FROM dataframe WHERE Stream_Time BETWEEN last_run and NOW()``` , where last_run is some variable that has the value of `NOW()` from the last run. And have repeat this query untill last_run is >= the RUNTUME.

Sound like a good idea. 

How should we put that parquet to a database? Well, Pyspark can write to a JDBC (if you don't know what these are, think of it like a library that allows connecting java application to databases). However, it kept raising an out of memory error. Therefore, we will do it the old way.

Pandas has a way to write directly to sqlite3 database, let's try that and see if the performance is acceptable.

In [1]:
import pandas as pd
%timeit pd.read_parquet('./parquet/stream_df.parquet')

10.3 s ± 164 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [2]:
df = pd.read_parquet('./parquet/stream_df.parquet')

In [3]:
import sqlite3 
conn = sqlite3.connect('./stream_df.db')
cur = conn.cursor()

In [4]:
df.to_sql(name='stream', con=conn, if_exists='replace')

That's acceptable. Now let's start a kafka producer and start the stream. The kafka cluster is already up and running. [Kafka](https://kafka.apache.org/quickstart)'s official website provides a quickstart guide on how to start the cluster and create a topic. Those steps will be enough for now.

But before doing that we need to test the speed of our database.

In [5]:
%timeit cur.execute('SELECT * FROM stream WHERE Stream_Time BETWEEN 0 AND 8')

16.7 µs ± 234 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


pretty fast. Now let's start.

In [7]:
from kafka import KafkaProducer
from config import TOPIC, SERVER_PORT, RUNTIME
import time
import pickle

def serializer(string):
    return pickle.dumps(string)

producer = KafkaProducer(bootstrap_servers=[SERVER_PORT],
                         value_serializer = serializer)
start_time = time.time()
last_run = 0
while last_run <= RUNTIME:
    now = time.time() - start_time
    cur.execute('SELECT * FROM stream WHERE Stream_Time BETWEEN ? AND ?'
                    ,(last_run, now))
    last_run = now
    response = cur.fetchall()
    if response != []:
        producer.send(TOPIC,response)

KeyboardInterrupt: 

Our kafka console consumer recieved the data. Our job here is done.