San Francisco Fire Calls
This notebook is the end-to-end example from Chapter 3, showing how to use DataFrame and Spark SQL for common data analytics patterns and operations on a San Francisco Fire Department Calls dataset.

In [6]:
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import SparkSession

sf_fire_file = "sf-fire-calls.csv"

Define our schema as the file has 4 million records. Inferring the schema is expensive for large files.

In [7]:
fire_schema = StructType([StructField('CallNumber', IntegerType(), True),
                     StructField('UnitID', StringType(), True),
                     StructField('IncidentNumber', IntegerType(), True),
                     StructField('CallType', StringType(), True),                  
                     StructField('CallDate', StringType(), True),      
                     StructField('WatchDate', StringType(), True),
                     StructField('CallFinalDisposition', StringType(), True),
                     StructField('AvailableDtTm', StringType(), True),
                     StructField('Address', StringType(), True),       
                     StructField('City', StringType(), True),       
                     StructField('Zipcode', IntegerType(), True),       
                     StructField('Battalion', StringType(), True),                 
                     StructField('StationArea', StringType(), True),       
                     StructField('Box', StringType(), True),       
                     StructField('OriginalPriority', StringType(), True),       
                     StructField('Priority', StringType(), True),       
                     StructField('FinalPriority', IntegerType(), True),       
                     StructField('ALSUnit', BooleanType(), True),       
                     StructField('CallTypeGroup', StringType(), True),
                     StructField('NumAlarms', IntegerType(), True),
                     StructField('UnitType', StringType(), True),
                     StructField('UnitSequenceInCallDispatch', IntegerType(), True),
                     StructField('FirePreventionDistrict', StringType(), True),
                     StructField('SupervisorDistrict', StringType(), True),
                     StructField('Neighborhood', StringType(), True),
                     StructField('Location', StringType(), True),
                     StructField('RowID', StringType(), True),
                     StructField('Delay', FloatType(), True)])

In [8]:
#create a SparkSession
spark = (SparkSession
    .builder
    .appName("Example-3_6")    
    .getOrCreate())

In [9]:
fire_df = spark.read.csv(sf_fire_file, header=True, schema=fire_schema)

In [10]:
fire_df.cache()

DataFrame[CallNumber: int, UnitID: string, IncidentNumber: int, CallType: string, CallDate: string, WatchDate: string, CallFinalDisposition: string, AvailableDtTm: string, Address: string, City: string, Zipcode: int, Battalion: string, StationArea: string, Box: string, OriginalPriority: string, Priority: string, FinalPriority: int, ALSUnit: boolean, CallTypeGroup: string, NumAlarms: int, UnitType: string, UnitSequenceInCallDispatch: int, FirePreventionDistrict: string, SupervisorDistrict: string, Neighborhood: string, Location: string, RowID: string, Delay: float]

In [11]:
fire_df.count()

175296

In [12]:
fire_df.printSchema()

root
 |-- CallNumber: integer (nullable = true)
 |-- UnitID: string (nullable = true)
 |-- IncidentNumber: integer (nullable = true)
 |-- CallType: string (nullable = true)
 |-- CallDate: string (nullable = true)
 |-- WatchDate: string (nullable = true)
 |-- CallFinalDisposition: string (nullable = true)
 |-- AvailableDtTm: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Zipcode: integer (nullable = true)
 |-- Battalion: string (nullable = true)
 |-- StationArea: string (nullable = true)
 |-- Box: string (nullable = true)
 |-- OriginalPriority: string (nullable = true)
 |-- Priority: string (nullable = true)
 |-- FinalPriority: integer (nullable = true)
 |-- ALSUnit: boolean (nullable = true)
 |-- CallTypeGroup: string (nullable = true)
 |-- NumAlarms: integer (nullable = true)
 |-- UnitType: string (nullable = true)
 |-- UnitSequenceInCallDispatch: integer (nullable = true)
 |-- FirePreventionDistrict: string (nullable = true)
 

In [1]:
fire_df.take(5)

NameError: name 'fire_df' is not defined

Filter out "Medical Incident" call types

Note that filter() and where() methods on the DataFrame are similar. Check relevant documentation for their respective argument types.

In [14]:
few_fire_df = (fire_df.select("IncidentNumber", "AvailableDtTm", "CallType") 
              .where(col("CallType") != "Medical Incident"))

few_fire_df.show(5, truncate=False)

+--------------+----------------------+--------------+
|IncidentNumber|AvailableDtTm         |CallType      |
+--------------+----------------------+--------------+
|2003235       |01/11/2002 01:51:44 AM|Structure Fire|
|2003250       |01/11/2002 04:16:46 AM|Vehicle Fire  |
|2003259       |01/11/2002 06:01:58 AM|Alarms        |
|2003279       |01/11/2002 08:03:26 AM|Structure Fire|
|2003301       |01/11/2002 09:46:44 AM|Alarms        |
+--------------+----------------------+--------------+
only showing top 5 rows



Q-1) How many distinct types of calls were made to the Fire Department?

To be sure, let's not count "null" strings in that column.

In [15]:
fire_df.select("CallType").where(col("CallType").isNotNull()).distinct().count()

30

Q-2) What are distinct types of calls were made to the Fire Department?

These are all the distinct type of call to the SF Fire Department

In [16]:
fire_df.select("CallType").where(col("CallType").isNotNull()).distinct().show(10, False)

+-----------------------------------+
|CallType                           |
+-----------------------------------+
|Elevator / Escalator Rescue        |
|Marine Fire                        |
|Aircraft Emergency                 |
|Confined Space / Structure Collapse|
|Administrative                     |
|Alarms                             |
|Odor (Strange / Unknown)           |
|Citizen Assist / Service Call      |
|HazMat                             |
|Watercraft in Distress             |
+-----------------------------------+
only showing top 10 rows



Q-3) Find out all response or delayed times greater than 5 mins?

* Rename the column Delay - > ReponseDelayedinMins
* Returns a new DataFrame
* Find out all calls where the response time to the fire site was delayed for more than 5 mins

In [17]:
new_fire_df = fire_df.withColumnRenamed("Delay", "ResponseDelayedinMins")
new_fire_df.select("ResponseDelayedinMins").where(col("ResponseDelayedinMins") > 5).show(5, False)

+---------------------+
|ResponseDelayedinMins|
+---------------------+
|5.35                 |
|6.25                 |
|5.2                  |
|5.6                  |
|7.25                 |
+---------------------+
only showing top 5 rows



Let's do some ETL:

Transform the string dates to Spark Timestamp data type so we can make some time-based queries later
Returns a transformed query
Cache the new DataFrame

In [18]:
fire_ts_df = (new_fire_df
              .withColumn("IncidentDate", to_timestamp(col("CallDate"), "MM/dd/yyyy")).drop("CallDate") 
              .withColumn("OnWatchDate",   to_timestamp(col("WatchDate"), "MM/dd/yyyy")).drop("WatchDate")
              .withColumn("AvailableDtTS", to_timestamp(col("AvailableDtTm"), "MM/dd/yyyy hh:mm:ss a")).drop("AvailableDtTm"))     

In [19]:
fire_ts_df.cache()
fire_ts_df.columns

['CallNumber',
 'UnitID',
 'IncidentNumber',
 'CallType',
 'CallFinalDisposition',
 'Address',
 'City',
 'Zipcode',
 'Battalion',
 'StationArea',
 'Box',
 'OriginalPriority',
 'Priority',
 'FinalPriority',
 'ALSUnit',
 'CallTypeGroup',
 'NumAlarms',
 'UnitType',
 'UnitSequenceInCallDispatch',
 'FirePreventionDistrict',
 'SupervisorDistrict',
 'Neighborhood',
 'Location',
 'RowID',
 'ResponseDelayedinMins',
 'IncidentDate',
 'OnWatchDate',
 'AvailableDtTS']

Check the transformed columns with Spark Timestamp type

In [20]:
fire_ts_df.select("IncidentDate", "OnWatchDate", "AvailableDtTS").show(5, False)

+-------------------+-------------------+-------------------+
|IncidentDate       |OnWatchDate        |AvailableDtTS      |
+-------------------+-------------------+-------------------+
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 01:51:44|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 03:01:18|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 02:39:50|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 04:16:46|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 06:01:58|
+-------------------+-------------------+-------------------+
only showing top 5 rows



Q-4) What were the most common call types?

List them in descending order

In [21]:
(fire_ts_df
 .select("CallType").where(col("CallType").isNotNull())
 .groupBy("CallType")
 .count()
 .orderBy("count", ascending=False)
 .show(n=10, truncate=False))

+-------------------------------+------+
|CallType                       |count |
+-------------------------------+------+
|Medical Incident               |113794|
|Structure Fire                 |23319 |
|Alarms                         |19406 |
|Traffic Collision              |7013  |
|Citizen Assist / Service Call  |2524  |
|Other                          |2166  |
|Outside Fire                   |2094  |
|Vehicle Fire                   |854   |
|Gas Leak (Natural and LP Gases)|764   |
|Water Rescue                   |755   |
+-------------------------------+------+
only showing top 10 rows



Q-4a) What zip codes accounted for most common calls?

Let's investigate what zip codes in San Francisco accounted for most fire calls and what type where they.

* Filter out by CallType
* Group them by CallType and Zip code
* Count them and display them in descending order

It seems like the most common calls were all related to Medical Incident, and the two zip codes are 94102 and 94103.

In [22]:
(fire_ts_df
 .select("CallType", "ZipCode")
 .where(col("CallType").isNotNull())
 .groupBy("CallType", "Zipcode")
 .count()
 .orderBy("count", ascending=False)
 .show(10, truncate=False))

+----------------+-------+-----+
|CallType        |Zipcode|count|
+----------------+-------+-----+
|Medical Incident|94102  |16130|
|Medical Incident|94103  |14775|
|Medical Incident|94110  |9995 |
|Medical Incident|94109  |9479 |
|Medical Incident|94124  |5885 |
|Medical Incident|94112  |5630 |
|Medical Incident|94115  |4785 |
|Medical Incident|94122  |4323 |
|Medical Incident|94107  |4284 |
|Medical Incident|94133  |3977 |
+----------------+-------+-----+
only showing top 10 rows



Q-4b) What San Francisco neighborhoods are in the zip codes 94102 and 94103

Let's find out the neighborhoods associated with these two zip codes. In all likelihood, these are some of the contested neighborhood with high reported crimes.

In [23]:
fire_ts_df.select("Neighborhood", "Zipcode").where((col("Zipcode") == 94102) | (col("Zipcode") == 94103)).distinct().show(10, truncate=False)

+------------------------------+-------+
|Neighborhood                  |Zipcode|
+------------------------------+-------+
|Potrero Hill                  |94103  |
|Western Addition              |94102  |
|Tenderloin                    |94102  |
|Nob Hill                      |94102  |
|Castro/Upper Market           |94103  |
|South of Market               |94102  |
|South of Market               |94103  |
|Hayes Valley                  |94103  |
|Financial District/South Beach|94102  |
|Mission Bay                   |94103  |
+------------------------------+-------+
only showing top 10 rows



Q-5) What was the sum of all calls, average, min and max of the response times for calls?

Let's use the built-in Spark SQL functions to compute the sum, avg, min, and max of few columns:

Number of Total Alarms
What were the min and max the delay in response time before the Fire Dept arrived at the scene of the call

In [24]:
fire_ts_df.select(year('IncidentDate')).distinct().orderBy(year('IncidentDate')).show()

+------------------+
|year(IncidentDate)|
+------------------+
|              2000|
|              2001|
|              2002|
|              2003|
|              2004|
|              2005|
|              2006|
|              2007|
|              2008|
|              2009|
|              2010|
|              2011|
|              2012|
|              2013|
|              2014|
|              2015|
|              2016|
|              2017|
|              2018|
+------------------+



Q-6b) What week of the year in 2018 had the most fire calls?

Note: Week 1 is the New Years' week and week 25 is the July 4 the week. Loads of fireworks, so it makes sense the higher number of calls.

In [25]:
fire_ts_df.filter(year('IncidentDate') == 2018).groupBy(weekofyear('IncidentDate')).count().orderBy('count', ascending=False).show()

+------------------------+-----+
|weekofyear(IncidentDate)|count|
+------------------------+-----+
|                      22|  259|
|                      40|  255|
|                      43|  250|
|                      25|  249|
|                       1|  246|
|                      44|  244|
|                      13|  243|
|                      32|  243|
|                      11|  240|
|                       5|  236|
|                      18|  236|
|                      23|  235|
|                      31|  234|
|                       2|  234|
|                      42|  234|
|                      19|  233|
|                      10|  232|
|                      34|  232|
|                       8|  232|
|                      21|  231|
+------------------------+-----+
only showing top 20 rows



Q-7) What neighborhoods in San Francisco had the worst response time in 2018?

It appears that if you living in Presidio Heights, the Fire Dept arrived in less than 3 mins, while Mission Bay took more than 6 mins.

In [26]:
fire_ts_df.select("Neighborhood", "ResponseDelayedinMins").filter(year("IncidentDate") == 2018).show(10, False)

+------------------------------+---------------------+
|Neighborhood                  |ResponseDelayedinMins|
+------------------------------+---------------------+
|Presidio Heights              |2.8833334            |
|Mission Bay                   |6.3333335            |
|Chinatown                     |2.65                 |
|Financial District/South Beach|3.5333333            |
|Tenderloin                    |1.1                  |
|Bayview Hunters Point         |4.05                 |
|Inner Richmond                |2.5666666            |
|Inner Sunset                  |1.4                  |
|Sunset/Parkside               |2.6666667            |
|South of Market               |1.7666667            |
+------------------------------+---------------------+
only showing top 10 rows



Q-8a) How can we use Parquet files or SQL table to store data and read it back?

In [32]:
fire_ts_df.write.format("parquet").mode("overwrite").save("/tmp/fireServiceParquet/")

Q-8b) How can we use Parquet SQL table to store data and read it back?

In [29]:
fire_ts_df.write.format("parquet").mode("overwrite").saveAsTable("FireServiceCallsPy")

Q-8c) How can read data from Parquet file?

Note we don't have to specify the schema here since it's stored as part of the Parquet metadata

In [33]:
file_parquet_df = spark.read.format("parquet").load("/tmp/fireServiceParquet/")

In [34]:
display(file_parquet_df.limit(10))

DataFrame[CallNumber: int, UnitID: string, IncidentNumber: int, CallType: string, CallFinalDisposition: string, Address: string, City: string, Zipcode: int, Battalion: string, StationArea: string, Box: string, OriginalPriority: string, Priority: string, FinalPriority: int, ALSUnit: boolean, CallTypeGroup: string, NumAlarms: int, UnitType: string, UnitSequenceInCallDispatch: int, FirePreventionDistrict: string, SupervisorDistrict: string, Neighborhood: string, Location: string, RowID: string, ResponseDelayedinMins: float, IncidentDate: timestamp, OnWatchDate: timestamp, AvailableDtTS: timestamp]