Now that the data is written to a kafka topic, we need to do some processing. We can write a spark job to do this task, but since the source and sink are both kafka topics it's better to use kafka streams.

This is where a problem rises. Kafka streams is only supported in Java, and I'm not as fluent in Java as I wish to be. Therefore, we will use faust which, per faust documentation, "is a stream processing library, porting the ideas from Kafka Streams to Python."

Faust requires to run from terminal, therefore this notebook will not contain any output. The code in this notebook will be put into a .py file and be executed from terminal.

One thing that we "want" is the schema of our topic and I say "want" because it is not required. For that we can use some vim magic (actually it's regex magic, just done using vim). I'll include a quick guide of how we get the schema easily. 

First we need to copy the schema by calling `stream_df.printSchema()` on the pyspark df from `prepare_stream` or `stream` scripts. Then we copy that and paste to vim using `Ctrl+Shift+v`.

Then we clear the `|--` before each column by running `:%s/ |-- /`

Next we remove the `(nullable = true)` by running `:%s/(n.*)/`

Change 'string' to 'str' with `:%s/string/str`

And change 'integer' to 'int' with `:%s/integer/int`

Clean the units in columns' names with `:%s/(/_` and `:%s/)/` and change the '%' from humidity column name to the word 'Percent'

Then copy all the result using your mouse (just easier).

In [2]:
# Note that faust is kind of dead. There is a community-maintained fork called faust-streaming
!pip install faust-streaming
import faust
from config import SOURCE_TOPIC, SERVER_PORT, START_EVENTS_TOPIC, END_EVENTS_TOPIC

In [6]:

class row(faust.Record, validation = True):
    Event: str 
    Stream_Time: float 
    ID: int 
    Severity: int 
    Start_Time: str 
    End_Time: str 
    Start_Lat: float 
    Start_Lng: float 
    End_Lat: float 
    End_Lng: float 
    Distance_mi: str 
    Description: str 
    Number: int
    Street: str 
    Side: str 
    City: str 
    County: str 
    State: str 
    Zipcode: str 
    Country: str 
    Timezone: str 
    Airport_Code: str 
    Weather_Timestamp: str 
    Temperature_F: float 
    Wind_Chill_F: float 
    Humidity_Percent: float 
    Pressure_in: float
    Visibility_mi: float 
    Wind_Direction: str 
    Wind_Speed_mph: float 
    Precipitation_in: float 
    Weather_Condition: str 
    Amenity: int 
    Bump: int
    Crossing: int 
    Give_Way: int 
    Junction: int 
    No_Exit: int 
    Railway: int 
    Roundabout: int 
    Station: int 
    Stop: int 
    Traffic_Calming: int
    Traffic_Signal: int
    Turning_Loop: int 
    Sunrise_Sunset: str 
    Civil_Twilight: str 
    Nautical_Twilight: str 
    Astronomical_Twilight: str 

In our final versions, we will move this to a separate module. Maybe even create a file with a class that contains the schemas in their different representations.

In [4]:
app = faust.App('Divider',broker = f"kafka://{SERVER_PORT}")
source_topic = app.topic(TOPIC, value_type = row)
start_topic = app.topic(START_EVENTS_TOPIC, value_type = row)
end_topic = app.topic(END_EVENTS_TOPIC, value_type = row)

In [8]:
@app.agent(source_topic)
async def end_reading(events):
    async for event in events:
        if event.Event == 'Start':
            await start_event.send(value = event)
        else:
            await end_event.send(value = event)
app.main()

You will have to take my word for it when I tell you that this works. But I promise it does.

This is the ingestion layer done.

You may ask, why did we split the data? Well, there are two reasons for that.

First, I needed an excuse to get into kafka streaming (or faust, you know what I'm talking about) and the source topic is still accessible anyway, so no harm done.

Second, in the rare case that we might want to hook a data viz tool directly to kafka (say maybe a live map of total accidents), we will need to disregard the end of an accidents. If we take both start and end events we will end up with double the accidents. Of course we can filter the records in the viz script/tool directly, but just as a personal preference I prefer doing the work in the ingestion layer.

We can also create a Ktable to aggregate the accidents over any column. For instance, we can create a Ktable that has the number of accidents per state. To offload the aggregation from the viz script, but we won't go there now.