In [0]:
spark

## San Francisco Open Data
<li> Adapted from Sameer Farooqui talk (July 2016)</li>
<li> 1.6 GB from Fire Department Calls for Service </li>
<li> The data set has been uploaded to Amazon S3 </li>
<li> We have to mount the respective S3 folder as follows: </li>

In [0]:
ACCESSY_KEY_ID = "AKIAJBRYNXGHORDHZB4A"
SECERET_ACCESS_KEY = "a0BzE1bSegfydr3%2FGE3LSPM6uIV5A4hOUfpH8aFF" 

bucket = 'databricks-corp-training/sf_open_data/'
mount_folder = '/mnt/sf_open_data'

try:
  dbutils.fs.unmount(mount_folder)
except:
  pass
finally: #If MOUNT_FOLDER does not exist
  dbutils.fs.mount("s3a://"+ ACCESSY_KEY_ID + ":" + SECERET_ACCESS_KEY + "@" + bucket, mount_folder)

In [0]:
%fs ls /mnt/sf_open_data/fire_dept_calls_for_service/

In [0]:
%fs ls /mnt/sf_open_data/

In [0]:
%fs head /mnt/sf_open_data/fire_dept_calls_for_service/Fire_Department_Calls_for_Service.csv

Loading the csv file into a dataframe by inferring the schema (notice the runtime that it takes)

In [0]:
fireServiceCallsDF = spark.read.csv('/mnt/sf_open_data/fire_dept_calls_for_service/Fire_Department_Calls_for_Service.csv', header=True, inferSchema=True)
#fireServiceCallsDF.cache()

In [0]:
fireServiceCallsDF.printSchema()

Inferring schemas is time-consuming when working with big files. Providing an explicit pre-defined schema manually is better, so there's no inferring cost:

In [0]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, BooleanType

In [0]:
fireSchema = StructType([StructField('CallNumber', IntegerType(), True),
                     StructField('UnitID', StringType(), True),
                     StructField('IncidentNumber', IntegerType(), True),
                     StructField('CallType', StringType(), True),                  
                     StructField('CallDate', StringType(), True),       
                     StructField('WatchDate', StringType(), True),       
                     StructField('ReceivedDtTm', StringType(), True),       
                     StructField('EntryDtTm', StringType(), True),       
                     StructField('DispatchDtTm', StringType(), True),       
                     StructField('ResponseDtTm', StringType(), True),       
                     StructField('OnSceneDtTm', StringType(), True),       
                     StructField('TransportDtTm', StringType(), True),                  
                     StructField('HospitalDtTm', StringType(), True),       
                     StructField('CallFinalDisposition', StringType(), True),       
                     StructField('AvailableDtTm', StringType(), True),       
                     StructField('Address', StringType(), True),       
                     StructField('City', StringType(), True),       
                     StructField('ZipcodeofIncident', IntegerType(), True),       
                     StructField('Battalion', StringType(), True),                 
                     StructField('StationArea', StringType(), True),       
                     StructField('Box', StringType(), True),       
                     StructField('OriginalPriority', StringType(), True),       
                     StructField('Priority', StringType(), True),       
                     StructField('FinalPriority', IntegerType(), True),       
                     StructField('ALSUnit', BooleanType(), True),       
                     StructField('CallTypeGroup', StringType(), True),
                     StructField('NumberofAlarms', IntegerType(), True),
                     StructField('UnitType', StringType(), True),
                     StructField('Unitsequenceincalldispatch', IntegerType(), True),
                     StructField('FirePreventionDistrict', StringType(), True),
                     StructField('SupervisorDistrict', StringType(), True),
                     StructField('NeighborhoodDistrict', StringType(), True),
                     StructField('Location', StringType(), True),
                     StructField('RowID', StringType(), True)])

In [0]:
fireServiceCallsDF = spark.read.csv('/mnt/sf_open_data/fire_dept_calls_for_service/Fire_Department_Calls_for_Service.csv', header=True, schema=fireSchema)

In [0]:
fireServiceCallsDF.printSchema()

Cache the DataFrame in memory for faster access. The data will be cached lazily.

In [0]:
fireServiceCallsDF.cache()

Now we count how many rows are in the DF and notice how long it takes

In [0]:
fireServiceCallsDF.count()

Now we run the count( ) again and note how long it takes

In [0]:
fireServiceCallsDF.count()

In [0]:
display(fireServiceCallsDF.limit(5))

## Some queries to get insights from data

Q1: How many different types of calls is receving the call service?

In [0]:
fireServiceCallsDF.select('CallType').distinct().show(35,False)

Q2: How many incidents of each call type were there?

In [0]:
display(fireServiceCallsDF.select('CallType').groupBy('CallType').count().orderBy("count", ascending=False))

Q3: How many incidents were reported  from each neighborhood district?

In [0]:
display(fireServiceCallsDF.select('NeighborhoodDistrict').groupBy('NeighborhoodDistrict').count().orderBy("count", ascending=False))

## Date/Time Analisys

Q4: How many years of Fire Service Calls is in the data file?

In [0]:
from pyspark.sql.functions import *

In [0]:
from_pattern1 = 'MM/dd/yyyy'
to_pattern1 = 'yyyy-MM-dd'

from_pattern2 = 'MM/dd/yyyy hh:mm:ss aa'
to_pattern2 = 'MM/dd/yyyy hh:mm:ss aa'


fireServiceCallsTsDF = fireServiceCallsDF \
  .withColumn('CallDateTS', unix_timestamp(fireServiceCallsDF['CallDate'], from_pattern1).cast("timestamp")).drop('CallDate') \
  .withColumn('WatchDateTS', unix_timestamp(fireServiceCallsDF['WatchDate'], from_pattern1).cast("timestamp")).drop('WatchDate') \
  .withColumn('ReceivedDtTmTS', unix_timestamp(fireServiceCallsDF['ReceivedDtTm'], from_pattern2).cast("timestamp")).drop('ReceivedDtTm') \
  .withColumn('EntryDtTmTS', unix_timestamp(fireServiceCallsDF['EntryDtTm'], from_pattern2).cast("timestamp")).drop('EntryDtTm') \
  .withColumn('DispatchDtTmTS', unix_timestamp(fireServiceCallsDF['DispatchDtTm'], from_pattern2).cast("timestamp")).drop('DispatchDtTm') \
  .withColumn('ResponseDtTmTS', unix_timestamp(fireServiceCallsDF['ResponseDtTm'], from_pattern2).cast("timestamp")).drop('ResponseDtTm') \
  .withColumn('OnSceneDtTmTS', unix_timestamp(fireServiceCallsDF['OnSceneDtTm'], from_pattern2).cast("timestamp")).drop('OnSceneDtTm') \
  .withColumn('TransportDtTmTS', unix_timestamp(fireServiceCallsDF['TransportDtTm'], from_pattern2).cast("timestamp")).drop('TransportDtTm') \
  .withColumn('HospitalDtTmTS', unix_timestamp(fireServiceCallsDF['HospitalDtTm'], from_pattern2).cast("timestamp")).drop('HospitalDtTm') \
  .withColumn('AvailableDtTmTS', unix_timestamp(fireServiceCallsDF['AvailableDtTm'], from_pattern2).cast("timestamp")).drop('AvailableDtTm')  


In [0]:
fireServiceCallsTsDF.printSchema()

In [0]:
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
display(fireServiceCallsTsDF.limit(5))


In [0]:
fireServiceCallsTsDF.select(year('CallDateTS')).distinct().orderBy('year(CallDateTS)').show()

Q5: How many service calls were registered in the week before July 4?

In [0]:
from datetime import datetime
july4 = datetime(2016, 7, 4, 0, 0, 0)
june27 = datetime(2016, 6, 27, 0, 0, 0)

In [0]:
from pyspark.sql.functions import to_date

fireServiceCallsTsDF \
  .where(col('CallDateTS') >= june27) \
  .groupBy(to_date('CallDateTS')) \
  .count() \
  .orderBy(to_date('CallDateTS')) \
  .show()

Now we use Spark SQL through the %sql command but first we create a temporal view from the DF

In [0]:
fireServiceCallsTsDF.createOrReplaceTempView("firecalls")

In [0]:
%sql select * from firecalls limit 5

## DF joins

Q5: What was the primary non-medical reason most people called the fire department from the Tenderloin last year? <br>
The "Fire Incidents" data includes a summary of each (non-medical) incident to which the SF Fire Department responded. <br>
Let's do a join to the Fire Incidents data on the "Incident Number" column.

In [0]:
%sql 
-- IncidentNumber will be the join column with the Fire Incidents CSV file
select CallDateTS, IncidentNumber, CallType
from firecalls
limit 5

In [0]:
incidentsDF = spark.read.csv('/mnt/sf_open_data/fire_incidents/Fire_Incidents.csv', header=True, inferSchema=True).withColumnRenamed('Incident Number', 'IncidentNumber').cache()
incidentsDF.count()

In [0]:
incidentsDF.printSchema()

In [0]:
display(incidentsDF.limit(3))

In [0]:
joinedDF = fireServiceCallsTsDF.join(incidentsDF, fireServiceCallsTsDF.IncidentNumber == incidentsDF.IncidentNumber)

display(joinedDF.limit(3))

In [0]:
joinedDF.count()

In [0]:
tenderloinDF = joinedDF.filter(year('CallDateTS') == 2015).filter(col('NeighborhoodDistrict') == 'Tenderloin')
tenderloinDF = tenderloinDF.groupBy('Primary Situation').count()
tenderloinDF = tenderloinDF.orderBy(desc("count"))

display(tenderloinDF.limit(10))

### Most of the calls were False Alarms!

Now the same in Spark SQL. Let's simply create a SQL view on the joinedDF for this:

In [0]:
joinedDF.createOrReplaceTempView("firecall_incidents")

In [0]:
%sql 
SELECT `Primary Situation`, count(*)
FROM firecall_incidents
WHERE year(CallDateTS) = 2015
  AND NeighborhoodDistrict = 'Tenderloin'
GROUP BY `Primary Situation`
ORDER BY COUNT(*) DESC
LIMIT 10

Let's take a look in the number of partitions and the execute the count( ) action.

In [0]:
fireServiceCallsTsDF.rdd.getNumPartitions()

Expand the Spark Jobs and see that 13+1 tasks were launched to execute the count( ). Then, click on the information icon.

In [0]:
fireServiceCallsTsDF.count()

In [0]:
spark.sql("SELECT NeighborhoodDistrict, count(NeighborhoodDistrict) AS Neighborhood_Count FROM firecalls WHERE year(CallDateTS) == 2015 GROUP BY NeighborhoodDistrict ORDER BY Neighborhood_Count DESC LIMIT 15").explain(True)
