Now that the data is written to a kafka topic, we need to do some processing. We can write a spark job to do this task, but since the source and sink are both kafka topics it's better to use kafka streams.

This is where a problem rises. Kafka streams is only supported in Java, and I'm not as fluent in Java as I wish to be. Therefore, we will use faust which, per faust documentation, "is a stream processing library, porting the ideas from Kafka Streams to Python."

Faust requires to run from terminal, therefore this notebook will not contain any output. The code in this notebook will be put into a .py file and be executed from terminal.

One thing that we "want" is the schema of our topic and I say "want" because it is not required. For that we can use some vim magic (actually it's regex magic, just done using vim). I'll include a quick guide of how we get the schema easily. 

First we need to copy the schema by calling `df.printSchema()` on the pyspark df from `prepare_stream` or `stream` scripts. Then we copy that and paste to vim using `Ctrl+Shift+v`.

Then we clear the `|--` before each column by running `:%s/ |-- /`

Next we remove the `(nullable = true)` by running `:%s/(n.*)/`

Change 'string' to 'str' with `:%s/string/str`

And change 'integer' to 'int' with `:%s/integer/int`

Clean the units in columns' names with `:%s/(/_` and `:%s/)/` and remove the '\_%' from humidity column name.

Then copy all the result using your mouse (just easier).

In [2]:
# Note that faust is kind of dead. There is a community-maintained fork called faust-streaming
!pip install faust-streaming
import faust
from config import SOURCE_TOPIC, SERVER_PORT, START_EVENTS_TOPIC, END_EVENTS_TOPIC

In [6]:
class row(faust.Record, validation = True):
    Start: str 
    Stream_Time: float 
    ID: int 
    Severity: str 
    Start_Time: str 
    End_Time: str 
    Start_Lat: str 
    Start_Lng: str 
    End_Lat: str 
    End_Lng: str 
    Distance_mi: str 
    Description: str 
    Number: str 
    Street: str 
    Side: str 
    City: str 
    County: str 
    State: str 
    Zipcode: str 
    Country: str 
    Timezone: str 
    Airport_Code: str 
    Weather_Timestamp: str 
    Temperature_F: str 
    Wind_Chill_F: str 
    Humidity: str 
    Pressure_in: str 
    Visibility_mi: str 
    Wind_Direction: str 
    Wind_Speed_mph: str 
    Precipitation_in: str 
    Weather_Condition: str 
    Amenity: str 
    Bump: str 
    Crossing: str 
    Give_Way: str 
    Junction: str 
    No_Exit: str 
    Railway: str 
    Roundabout: str 
    Station: str 
    Stop: str 
    Traffic_Calming: str 
    Traffic_Signal: str 
    Turning_Loop: str 
    Sunrise_Sunset: str 
    Civil_Twilight: str 
    Nautical_Twilight: str 
    Astronomical_Twilight: str 

In our final versions, we will move this to a separate module.

The class above is the schema of each accident, but remember that the producer to `SOURCE_TOPIC` sends a list of accidents. Therefore, we need to pass `value_type = List[Rows]` to the source topic. Unfortuantely, I don't know how to do this, so we will skip this part for now. 

In [4]:
app = faust.App('Divider',broker = f"kafka://{SERVER_PORT}")
source_topic = app.topic(TOPIC,value_type=list[row])
start_topic = app.topic(START_EVENTS_TOPIC, value_type = row)
end_topic = app.topic(END_EVENTS_TOPIC, value_type = row)

In [8]:
@app.agent(source_topic)
async def end_reading(records):
   async for record in records:
        async for event in record:
            if event['Start'] == 'Start':
                await start_event.send(value = event)
            else:
                await end_event.send(value = event)
app.main()

You will have to take my word for it when I tell you that this works. But I promise it does.

This is the ingestion layer done.