For this project, I decided to use Kafka just to practice connecting Spark with Kafka. Obviously, this project doesn't need Kafka since the data is in a csv. However, we can stream each row based on the start and end time.

So, we will make a config.py file that will contain a variable that decides how long should the stream take. In other words, how long should it take to stream the data beginning from the first start date to the last end date to kafka. This variable will be in seconds. 

Let's try to read it.

In [1]:
! pip install pyspark
import findspark
findspark.init()
from config import RUNTIME
RUNTIME

Defaulting to user installation because normal site-packages is not writeable


600

Now this is out of the way, let's scale the time.

From the previous notebook, we know the first start date is '2016-01-14 20:18:33' and the last end date is '2022-01-01 00:00:00'. Let's create a UDF to convert these dates to seconds, scale them to fit inside the RUNTIME window, and stream the entire record (without the end time) with the start time and then stream the end time when its time comes. 

First, the UDF!

In [2]:
from pyspark.sql.functions import udf

Oh wait, first the spark instance ... my bad.

In [3]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master('local[*]') \
    .config('spark.driver.memory','15g') \
    .appName("USA Accidents Analysis with Pyspark") \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/07/25 22:38:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
from datetime import datetime
def toUnix(date):
    return datetime.timestamp(date)

Let's test that and verify using an online tool.

In [5]:
toUnix(datetime.now())

1658781513.591259

That looks correct. However, the datetime.now() method returns a datetime type, not a string.
This means that our input needs to be converted to datetime. Let me redefine the function and I'll be right back.

In [6]:
def toUnix(date):
    # This is to account for milliseconds, we don't care about such precision
    date = date.split('.')[0]
    date = datetime.strptime(date, '%Y-%m-%d %H:%M:%S')
    return datetime.timestamp(date)

In [7]:
toUnix('2022-05-21 01:00:00')

1653087600.0

Perfect!! now, to a UDF.

In [8]:
udfToUnix = udf(toUnix)

In [9]:
main_df = spark.read.csv('./US_Accidents_Dec21_updated.csv')
main_df

DataFrame[_c0: string, _c1: string, _c2: string, _c3: string, _c4: string, _c5: string, _c6: string, _c7: string, _c8: string, _c9: string, _c10: string, _c11: string, _c12: string, _c13: string, _c14: string, _c15: string, _c16: string, _c17: string, _c18: string, _c19: string, _c20: string, _c21: string, _c22: string, _c23: string, _c24: string, _c25: string, _c26: string, _c27: string, _c28: string, _c29: string, _c30: string, _c31: string, _c32: string, _c33: string, _c34: string, _c35: string, _c36: string, _c37: string, _c38: string, _c39: string, _c40: string, _c41: string, _c42: string, _c43: string, _c44: string, _c45: string, _c46: string]

The headers ...

In [10]:
main_df = spark.read.csv('./US_Accidents_Dec21_updated.csv', header = True)
main_df

DataFrame[ID: string, Severity: string, Start_Time: string, End_Time: string, Start_Lat: string, Start_Lng: string, End_Lat: string, End_Lng: string, Distance(mi): string, Description: string, Number: string, Street: string, Side: string, City: string, County: string, State: string, Zipcode: string, Country: string, Timezone: string, Airport_Code: string, Weather_Timestamp: string, Temperature(F): string, Wind_Chill(F): string, Humidity(%): string, Pressure(in): string, Visibility(mi): string, Wind_Direction: string, Wind_Speed(mph): string, Precipitation(in): string, Weather_Condition: string, Amenity: string, Bump: string, Crossing: string, Give_Way: string, Junction: string, No_Exit: string, Railway: string, Roundabout: string, Station: string, Stop: string, Traffic_Calming: string, Traffic_Signal: string, Turning_Loop: string, Sunrise_Sunset: string, Civil_Twilight: string, Nautical_Twilight: string, Astronomical_Twilight: string]

In [11]:
from pyspark.sql.types import FloatType
main_df = main_df.withColumn('Start_Time_Unix', udfToUnix(main_df['Start_Time']) \
                                             .cast(FloatType()))
main_df = main_df.withColumn('End_Time_Unix', udfToUnix(main_df['End_Time'])
                                            .cast(FloatType()))
main_df

DataFrame[ID: string, Severity: string, Start_Time: string, End_Time: string, Start_Lat: string, Start_Lng: string, End_Lat: string, End_Lng: string, Distance(mi): string, Description: string, Number: string, Street: string, Side: string, City: string, County: string, State: string, Zipcode: string, Country: string, Timezone: string, Airport_Code: string, Weather_Timestamp: string, Temperature(F): string, Wind_Chill(F): string, Humidity(%): string, Pressure(in): string, Visibility(mi): string, Wind_Direction: string, Wind_Speed(mph): string, Precipitation(in): string, Weather_Condition: string, Amenity: string, Bump: string, Crossing: string, Give_Way: string, Junction: string, No_Exit: string, Railway: string, Roundabout: string, Station: string, Stop: string, Traffic_Calming: string, Traffic_Signal: string, Turning_Loop: string, Sunrise_Sunset: string, Civil_Twilight: string, Nautical_Twilight: string, Astronomical_Twilight: string, Start_Time_Unix: float, End_Time_Unix: float]

Now we're close! Lets create a table with each unix time and a column that stats the id, of course if two time stamps have the same id then the earlier is the start and the later is the end.

But first, we need to clean the ID column! This should be easy.

In [12]:
from pyspark.sql.functions import split
from pyspark.sql.types import IntegerType
main_df = main_df.withColumn('ID', split(main_df['ID'],'-').getItem(1).cast(IntegerType()))
main_df.select('ID').show()

+---+
| ID|
+---+
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
| 10|
| 11|
| 12|
| 13|
| 14|
| 15|
| 16|
| 17|
| 18|
| 19|
| 20|
+---+
only showing top 20 rows



In [13]:
temp_df = main_df.select('ID','Start_Time_Unix') \
        .union(main_df.select('ID','End_Time_Unix')) \
        .orderBy('Start_Time_Unix') \
        .orderBy('ID')
temp_df.show()
temp_df.printSchema()



+---+---------------+
| ID|Start_Time_Unix|
+---+---------------+
|  1|   1.45488461E9|
|  1|   1.45490624E9|
|  2|   1.45492544E9|
|  2|   1.45490381E9|
|  3|   1.45492659E9|
|  3|   1.45490496E9|
|  4|   1.45490714E9|
|  4|   1.45492877E9|
|  5|   1.45491085E9|
|  5|   1.45493248E9|
|  6|   1.45493376E9|
|  6|   1.45491226E9|
|  7|   1.45491213E9|
|  7|   1.45493376E9|
|  8|   1.45492506E9|
|  8|   1.45494669E9|
|  9|   1.45495565E9|
|  9|   1.45493402E9|
| 10|   1.45495898E9|
| 10|   1.45493734E9|
+---+---------------+
only showing top 20 rows

root
 |-- ID: integer (nullable = true)
 |-- Start_Time_Unix: float (nullable = true)



                                                                                

Absolute perfection! Maybe change column name, but other than that ... absolute perfection!

You may notice that the second column is basically the same value repeated, but this is multiplied by 10^9.

So, all this work and we still didn't get the scaling done. The scaling after this point is easy, we just subtract the earliest time and divide by the latest time and multiply by the RUNTIME.

So ... another UDF? 

In [14]:
temp_df = temp_df.withColumn('Time_Unix',temp_df['Start_Time_Unix'])
earliest = temp_df.agg({'Time_Unix':"min"}).collect()
earliest

                                                                                

[Row(min(Time_Unix)=1452795520.0)]

The value is inside a Pyspark row. Don't worry, we can get it out.
A simple google search leads us to a few methods to do so, below is two of them.

In [15]:
# earliest = earliest[0].__getitem__('min(Start_Time_Unix)')
earliest = earliest[0][0]
earliest

1452795520.0

In [16]:
latest = temp_df.agg({"Time_Unix":"max"}).collect()
latest = latest[0][0]
latest

                                                                                

1640988032.0

In [17]:
def scale(unix):
    return ((unix - earliest) / (latest - earliest))*RUNTIME 

# We can specify the return type of the udf, instead of the approach we used before
udfScaling = udf(scale,FloatType())
temp_df = temp_df.withColumn('Stream_Time',udfScaling(temp_df['Time_Unix']))

A job well done!
Just one final step.

In [18]:
temp_df = temp_df.withColumnRenamed('ID','temp_id')
to_delete = ('Start_Time_Unix','End_Time_Unix','Time_Unix',"temp_id")

In [19]:
from pyspark.sql.functions import broadcast
stream_df = temp_df.join(broadcast(main_df), temp_df.temp_id == main_df.ID) \
            .drop(*to_delete)
stream_df = stream_df.orderBy('Stream_Time')

In [20]:
stream_df.printSchema()

root
 |-- Stream_Time: float (nullable = true)
 |-- ID: integer (nullable = true)
 |-- Severity: string (nullable = true)
 |-- Start_Time: string (nullable = true)
 |-- End_Time: string (nullable = true)
 |-- Start_Lat: string (nullable = true)
 |-- Start_Lng: string (nullable = true)
 |-- End_Lat: string (nullable = true)
 |-- End_Lng: string (nullable = true)
 |-- Distance(mi): string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Number: string (nullable = true)
 |-- Street: string (nullable = true)
 |-- Side: string (nullable = true)
 |-- City: string (nullable = true)
 |-- County: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Zipcode: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- Timezone: string (nullable = true)
 |-- Airport_Code: string (nullable = true)
 |-- Weather_Timestamp: string (nullable = true)
 |-- Temperature(F): string (nullable = true)
 |-- Wind_Chill(F): string (nullable = true)
 |-- Humidity(%): string

That was alot of work. There is no way such table can fit on my machine memory. So, we will just write to a parquet file and read from it in the streaming script.

In [21]:
# stream_df.write.parquet('./Stream.parquet')
# spark.catalog.dropTempView("temp")
# spark.catalog.clearCache()
#desc_df = stream_df['Description']
#stream_df = stream_df.drop('Description')

In [23]:
from pyspark import SparkConf
conf = SparkConf()
conf.set('spark.sql.autoBroadcastJoinThreshold','-1')
conf.set('spark.driver.memory','5g') # default is 1024 MB or 1G
targetfolder = './Stream/'
stream_df.coalesce(1).write.parquet(targetfolder)

                                                                                

That previous block was maybe confusing. Basically I needed to write out to a parquet file, removing the coalesce(1) from the last line will give us a folder with multiple parquet files inside; that is a file for each partition.

To ask pyspark for a single file we need to reduce the number of partitions using repartition or coalesce. They both can reduce number of partitions; however, coalesce is more optimized since it can only reduce number of partitions (unlike repartition). 

We had to set the two configs because we ran out of memory when trying to write out the file. You can google their names and see what each one does.

Now to rename the output file so that we can read it from the streaming script without those weird names.

In [24]:
import os
os.listdir(targetfolder)

['part-00000-834623a2-24a5-49ec-9c7f-aa3f558bb987-c000.snappy.parquet',
 '.part-00000-834623a2-24a5-49ec-9c7f-aa3f558bb987-c000.snappy.parquet.crc',
 '_SUCCESS',
 '._SUCCESS.crc']

What even is that name, spark? It's ok, it's ok, we got it.

In [25]:
filename = os.listdir(targetfolder)[0]
os.rename(f'{targetfolder}{filename}', f'{targetfolder}stream_df.parquet')

And we're done for this one. Let's now stream that dataframe to a kafka topic. Woohoo!