When it comes to projects, I'm a firm believer that you should put yourself in weird situations, maybe you run into something similar while working one day, so just wing it.

Now that we have the parquet file, I have no idea how to stream that. Pyspark can stream directly to kafka, however it does so with no regards to time column. So ... maybe use a database and run a query like ```SELECT * FROM dataframe WHERE Stream_Time BETWEEN last_run and NOW()``` , where last_run is some variable that has the value of `NOW()` from the last run. And have repeat this query untill `last_run > RUNTUME`.

Sound like a good idea. 

How should we put that parquet to a database? Well, Pyspark can write to a JDBC (if you don't know what these are, think of it like a library that allows connecting java application to databases). However, sqlite3 doesn't work well with pyspark, it has a JDBC, but still not that great.

Pandas has a way to write directly to sqlite3 database, let's try that and see if the performance is acceptable.

In [1]:
import pandas as pd
%timeit pd.read_parquet('./Data/parquet/stream_df.parquet')

4.61 s ± 131 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [2]:
df = pd.read_parquet('./Data/parquet/stream_df.parquet')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5690684 entries, 0 to 5690683
Data columns (total 49 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   Start                  object        
 1   Stream_Time            float64       
 2   ID                     int32         
 3   Severity               int8          
 4   Start_Time             datetime64[ns]
 5   End_Time               datetime64[ns]
 6   Start_Lat              float64       
 7   Start_Lng              float64       
 8   End_Lat                float64       
 9   End_Lng                float64       
 10  Distance(mi)           float64       
 11  Description            object        
 12  Number                 float32       
 13  Street                 object        
 14  Side                   object        
 15  City                   object        
 16  County                 object        
 17  State                  object        
 18  Zipcode               

In [3]:
import sqlite3 
conn = sqlite3.connect('./Data/stream_df.db')
# Let sqlite return a list of dictionaries instead of list of tuples
conn.row_factory = sqlite3.Row 
cur = conn.cursor()

In [4]:
df.to_sql(name = 'stream', con = conn, if_exists = 'replace')

That's acceptable. Now let's start a kafka producer and start the stream. The kafka cluster is already up and running. [Kafka](https://kafka.apache.org/quickstart)'s official website provides a quickstart guide on how to start the cluster and create a topic. Those steps will be enough for now.

But before doing that we need to test the speed of our database.

In [5]:
%timeit cur.execute('SELECT * FROM stream WHERE Stream_Time BETWEEN 0 AND 8')

111 µs ± 572 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


pretty fast. Now let's start.

In [None]:
from kafka import KafkaProducer
from config import SOURCE_TOPIC, SERVER_PORT, RUNTIME
import time
import json
import sqlite3 
conn = sqlite3.connect('./Data/stream_df.db')
# Let sqlite return a list of dictionaries instead of list of tuples
conn.row_factory = sqlite3.Row 
cur = conn.cursor()
producer = KafkaProducer(bootstrap_servers=[SERVER_PORT],
                         value_serializer = lambda x: json.dumps(x).encode('utf-8'))

start_time = time.time()
last_run = 0
while last_run <= RUNTIME:
    now = time.time() - start_time
    cur.execute('SELECT * FROM stream WHERE Stream_Time BETWEEN ? AND ?'
                    ,(last_run, now))
    last_run = now
    
    # sqlite3 returns sqlite3.Row objects, need to call dict() on each object
    response = [dict(row) for row in cur.fetchall()] 
    if response != []:
        producer.send(topic = SOURCE_TOPIC, value = response)

Our kafka console consumer recieved the data. Our job here is done and can safely ctrl-c.

We sent `value` without a `key` because keys are just to group events with the same key in the same partition to ensure order (useful for keeping states). However, we don't really care about order of the events.

The next step is writing a kafka streams job to split the data into to other topics, one that includes the beginning of an event and the other contains the end of an event.

This concludes the streaming/ingestion layer. We will finalize the ingestion with the kafka streams in the next notebook.