When it comes to projects, I'm a firm believer that you should put yourself in weird situations, maybe you run into something similar while working one day, so just wing it.

Now that we have the parquet file, I have no idea how to stream that. Pyspark can stream directly to kafka, however it does so with no regards to time column. So ... maybe use a database and run a query like ```SELECT * FROM dataframe WHERE Stream_Time BETWEEN last_run and NOW()``` , where last_run is some variable that has the value of `NOW()` from the last run. And have repeat this query untill `last_run > RUNTUME`.

Sound like a good idea. 

How should we put that parquet to a database? Well, Pyspark can write to a JDBC (if you don't know what these are, think of it like a library that allows connecting java application to databases). However, sqlite3 doesn't work well with pyspark, it has a JDBC, but still not that great.

Pandas has a way to write directly to sqlite3 database, let's try that and see if the performance is acceptable.

Parquet preserves datatypes, but let's double check.

Now, to write to our database, to write a schema to a `sqlite3` database we need to use `sqlalchemy.engine` object, because we will use `sqlalchemy.types` to define the schema. However, for querying the database I much prefer `sqlite3` standard driver.

In [1]:
from sqlalchemy import create_engine
conn = create_engine('sqlite:///./Data/stream_df.db')

We will need to pass our schema to `to_sql` method. However, the `schema` parameter means something else in the `to_sql` function. The 'schema' we are looking for is passed through `dtype` parameter.

Of course I didn't write that by hand, I used some ✨ vim magic ✨ (It's regex magic, but I used vim for it). 

In [2]:
import sqlalchemy.types as T
dtypes = {'Event':T.String(),
            'Stream_Time':T.Float(asdecimal=True),
            'ID':T.Integer(),
            'Severity':T.SmallInteger(),
            'Start_Time':T.DateTime(),
            'End_Time':T.DateTime(),
            'Start_Lat':T.Float(asdecimal=True),
            'Start_Lng':T.Float(asdecimal=True),
            'End_Lat':T.Float(asdecimal=True),
            'End_Lng':T.Float(asdecimal=True),
            'Distance_mi':T.Float(asdecimal=True),
            'Description':T.String(),
            'Number':T.Integer,
            'Street':T.String(),
            'Side':T.String(),
            'City':T.String(),
            'County':T.String(),
            'State':T.String(),
            'Zipcode':T.String(),
            'Country':T.String(),
            'Timezone':T.String(),
            'Airport_Code':T.String(),
            'Weather_Timestamp':T.DateTime(),
            'Temperature_F':T.Float(asdecimal=True),
            'Wind_Chill_F':T.Float(asdecimal=True),
            'Humidity_Percent':T.Float(asdecimal=True),
            'Pressure_in':T.Float(asdecimal=True),
            'Visibility_m':T.Float(asdecimal=True),
            'Wind_Direction':T.String(),
            'Wind_Speed_mph':T.Float(asdecimal=True),
            'Precipitation_in':T.Float(asdecimal=True),
            'Weather_Condition':T.String(),
            'Amenity':T.Boolean,
            'Bump':T.Boolean,
            'Crossing':T.Boolean,
            'Give_Way':T.Boolean,
            'Junction':T.Boolean,
            'No_Exit':T.Boolean,
            'Railway':T.Boolean,
            'Roundabout':T.Boolean,
            'Station':T.Boolean,
            'Stop':T.Boolean,
            'Traffic_Calming':T.Boolean,
            'Traffic_Signal':T.Boolean,
            'Turning_Loop':T.Boolean,
            'Sunrise_Sunset':T.String(),
            'Civil_Twilight':T.String(),
            'Nautical_Twilight':T.String(),
            'Astronomical_Twilight':T.String()
}

In [3]:
import pandas as pd
df = pd.read_parquet('./Data/parquet/stream_df.parquet')
%time df.to_sql(name = 'stream', con = conn.engine, if_exists = 'replace', dtype = dtypes, \
                chunksize = 100000, index = False)
conn.dispose()

CPU times: user 3min 32s, sys: 13.2 s, total: 3min 45s
Wall time: 3min 59s


That's way too slow. Let's try dropping the schema and using sqlite3 connector.

In [4]:
import sqlite3 

conn = sqlite3.connect('./Data/stream_df.db')
# Let sqlite return a list of dictionaries instead of list of tuples
conn.row_factory = sqlite3.Row 
cur = conn.cursor()

In [5]:
%time df.to_sql(name = 'stream', con = conn, if_exists = 'replace', chunksize = 100000, \
               index = False)

CPU times: user 1min 32s, sys: 13.6 s, total: 1min 45s
Wall time: 2min 8s


That's less than half the time!

If that is faster, why even bother with the schema? Well, generally in databases, you should always define your data types appropriately. Using appropriate data types is usually better in storage space and in performance; it helps the RDBMS decide the best plan to query the data, helps you optimize indexes if you have any, and helps optimize space.

One thing to notice, `sqlite3` will try to infer the schema, and it incorrectly guesses the boolean columns (it saves them as Integers of zeroes and ones); however, it stores them as 0s and 1s. Therefore, when we try to read them especially with spark, we will have to declare their types as integers. Because spark expects values of `True` and `False` when you say boolean.

Let us test the speed of our database.

In [6]:
%timeit cur.execute('SELECT * FROM stream WHERE Stream_Time BETWEEN 0 AND 8')

107 µs ± 2.03 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


That's pretty good. Now let's start.

Now let's start a kafka producer and start the stream. The kafka cluster is already up and running. [Kafka](https://kafka.apache.org/quickstart)'s official website provides a quickstart guide on how to start the cluster and create a topic. Those steps will be enough for now.

In [8]:
from kafka import KafkaProducer
from config import SOURCE_TOPIC, SERVER_PORT, RUNTIME
import time
import json

producer = KafkaProducer(bootstrap_servers=[SERVER_PORT],
                         value_serializer = lambda x: json.dumps(x, indent = 4).encode('utf-8'))

try:
    start_time = time.time()
    last_run = 0
    while last_run <= RUNTIME:
        now = time.time() - start_time
        cur.execute('SELECT * FROM stream WHERE Stream_Time BETWEEN ? AND ?'
                        ,(last_run, now))
        last_run = now

        # sqlite3 returns sqlite3.Row objects, need to call dict() on each object
        response = [dict(row) for row in cur.fetchall()] 
        if response != []:
            for jsonObject in response:
                producer.send(topic = SOURCE_TOPIC, value = jsonObject)
except KeyboardInterrupt:
    cur.close()
    conn.close()

Our kafka console consumer recieved the data. Our job here is done and can safely ctrl-c.

We sent `value` without a `key` because keys are just to group events with the same key in the same partition to ensure order (useful for keeping states). However, we don't really care about order of the events.

The next step is writing a kafka streams job to split the data into to other topics, one that includes the beginning of an event and the other contains the end of an event.

This concludes the streaming/ingestion layer. We will finalize the ingestion with the kafka streams in the next notebook.