## DataFrame Operations

Most of Spark job start by reading data somewhere. The [DataFrameReader](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.html) interface lets us load a `DataFrame` from a variety of data sources like CSV, Parquet, JDBC, JSON, etc.

We'll use an example from _Learning Spark v2_ book, _Chapter 3_.

In [0]:
from pyspark.sql.types import IntegerType, StringType, BooleanType, FloatType, StructType, StructField

fire_schema = StructType([
    StructField('CallNumber', IntegerType(), True),
    StructField('UnitID', StringType(), True),
    StructField('IncidentNumber', IntegerType(), True),
    StructField('CallType', StringType(), True),
    StructField('CallDate', StringType(), True),
    StructField('WatchDate', StringType(), True),
    StructField('CallFinalDisposition', StringType(), True),
    StructField('AvailableDtTm', StringType(), True),
    StructField('Address', StringType(), True),
    StructField('City', StringType(), True),
    StructField('Zipcode', IntegerType(), True),
    StructField('Battalion', StringType(), True),
    StructField('StationArea', StringType(), True),
    StructField('Box', StringType(), True),
    StructField('OriginalPriority', StringType(), True),
    StructField('Priority', StringType(), True),
    StructField('FinalPriority', IntegerType(), True),
    StructField('ALSUnit', BooleanType(), True),
    StructField('CallTypeGroup', StringType(), True),
    StructField('NumAlarms', IntegerType(), True),
    StructField('UnitType', StringType(), True),
    StructField('UnitSequenceInCallDispatch', IntegerType(), True),
    StructField('FirePreventionDistrict', StringType(), True),
    StructField('SupervisorDistrict', StringType(), True),
    StructField('Neighborhood', StringType(), True),
    StructField('Location', StringType(), True),
    StructField('RowID', StringType(), True),
    StructField('Delay', FloatType(), True)
])

In [0]:
sf_fire_file = "/databricks-datasets/learning-spark-v2/sf-fire/sf-fire-calls.csv"
df = spark.read.csv(sf_fire_file, header=True, schema=fire_schema)

In [0]:
display(df)

CallNumber,UnitID,IncidentNumber,CallType,CallDate,WatchDate,CallFinalDisposition,AvailableDtTm,Address,City,Zipcode,Battalion,StationArea,Box,OriginalPriority,Priority,FinalPriority,ALSUnit,CallTypeGroup,NumAlarms,UnitType,UnitSequenceInCallDispatch,FirePreventionDistrict,SupervisorDistrict,Neighborhood,Location,RowID,Delay
20110014,M29,2003234,Medical Incident,01/11/2002,01/10/2002,Other,01/11/2002 01:58:43 AM,10TH ST/MARKET ST,SF,94103,B02,36,2338,1,1,2,True,,1,MEDIC,1,2.0,6,Tenderloin,"(37.7765408927183, -122.417501464907)",020110014-M29,5.233333
20110015,M08,2003233,Medical Incident,01/11/2002,01/10/2002,Other,01/11/2002 02:10:17 AM,300 Block of 5TH ST,SF,94107,B03,8,2243,1,1,2,True,,1,MEDIC,1,3.0,6,South of Market,"(37.7792841462441, -122.402061300134)",020110015-M08,3.0833333
20110016,B02,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:47:00 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,CHIEF,6,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-B02,3.05
20110016,B04,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:51:54 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,CHIEF,3,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-B04,2.3166666
20110016,D2,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:47:00 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,CHIEF,4,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-D2,3.0166667
20110016,E03,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:47:00 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,ENGINE,7,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-E03,2.6833334
20110016,E38,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:51:17 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,ENGINE,1,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-E38,2.1
20110016,E41,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:47:00 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,ENGINE,8,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-E41,2.7166667
20110016,M03,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:46:38 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,True,,1,MEDIC,10,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-M03,2.7666667
20110016,RS1,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:46:57 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,RESCUE SQUAD,9,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-RS1,3.2666667


## Writing data
We can use the [DataFrameWriter](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.html) interface to persist data in a variety of formats.

Now, let's filter the `DataFrame` and persist it. You can specify the format that Spark uses to persist via `.format`.

In [0]:
%fs mkdirs output

In [0]:
df.where(df.UnitID == 'M29').write.format("parquet").save('/output/fire_calls_m29.parquet')

[0;31m---------------------------------------------------------------------------[0m
[0;31mAnalysisException[0m                         Traceback (most recent call last)
[0;32m<command-1954381968156679>[0m in [0;36m<module>[0;34m[0m
[0;32m----> 1[0;31m [0mdf[0m[0;34m.[0m[0mwhere[0m[0;34m([0m[0mdf[0m[0;34m.[0m[0mUnitID[0m [0;34m==[0m [0;34m'M29'[0m[0;34m)[0m[0;34m.[0m[0mwrite[0m[0;34m.[0m[0mformat[0m[0;34m([0m[0;34m"parquet"[0m[0;34m)[0m[0;34m.[0m[0msave[0m[0;34m([0m[0;34m'/output/fire_calls_m29.parquet'[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m
[0;32m/databricks/spark/python/pyspark/sql/readwriter.py[0m in [0;36msave[0;34m(self, path, format, mode, partitionBy, **options)[0m
[1;32m    738[0m             [0mself[0m[0;34m.[0m[0m_jwrite[0m[0;34m.[0m[0msave[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[1;32m    739[0m         [0;32melse[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0;32m--> 740[0;31m       

In [0]:
%fs ls /output/

path,name,size,modificationTime
dbfs:/output/fire_calls_m29.parquet/,fire_calls_m29.parquet/,0,0


Mmm, something is strange. Why that trailing `/`? That's actually a folder. Let's look into it.

In [0]:
%fs ls /output/fire_calls_m29.parquet



Spark has actually partitioned the dataset behind the hood in multiple Parquet files. This happens automatically and we might not care about it most of the times. We can actually control how the dataset is partitioned by Spark via `partitionBy` method, but we'll look at this later.

The partition is hidden to us even when reading the folder.

In [0]:
display(
    spark
        .read
        .option('header',True)
        .parquet('/output/fire_calls_m29.parquet') # this is actually a folder
)



[Ref:](https://sparkbyexamples.com/pyspark/pyspark-partitionby-example/) _PySpark partitionBy() – Write to Disk Example_

Spark can split large dataset into smaller data partitions based on one or more partition keys. Transformations and filters on columns that are partition keys let the Spark Application run faster as transformations execute parallely on each partition.

In [0]:
df.write.format("parquet")\
    .partitionBy("CallType")\
    .mode("overwrite")\
    .save('/output/fire_calls_call_types.parquet')

Note: While writing the data as partitions, PySpark eliminates the partition column on the data file and adds partition column & value to the folder name, hence it saves some space on storage.

In [0]:
%fs ls /output/fire_calls_call_types.parquet

path,name,size,modificationTime
dbfs:/output/fire_calls_call_types.parquet/CallType=Administrative/,CallType=Administrative/,0,0
dbfs:/output/fire_calls_call_types.parquet/CallType=Aircraft Emergency/,CallType=Aircraft Emergency/,0,0
dbfs:/output/fire_calls_call_types.parquet/CallType=Alarms/,CallType=Alarms/,0,0
dbfs:/output/fire_calls_call_types.parquet/CallType=Assist Police/,CallType=Assist Police/,0,0
dbfs:/output/fire_calls_call_types.parquet/CallType=Citizen Assist %2F Service Call/,CallType=Citizen Assist %2F Service Call/,0,0
dbfs:/output/fire_calls_call_types.parquet/CallType=Confined Space %2F Structure Collapse/,CallType=Confined Space %2F Structure Collapse/,0,0
dbfs:/output/fire_calls_call_types.parquet/CallType=Electrical Hazard/,CallType=Electrical Hazard/,0,0
dbfs:/output/fire_calls_call_types.parquet/CallType=Elevator %2F Escalator Rescue/,CallType=Elevator %2F Escalator Rescue/,0,0
dbfs:/output/fire_calls_call_types.parquet/CallType=Explosion/,CallType=Explosion/,0,0
"dbfs:/output/fire_calls_call_types.parquet/CallType=Extrication %2F Entrapped (Machinery, Vehicle)/","CallType=Extrication %2F Entrapped (Machinery, Vehicle)/",0,0


In [0]:
%fs ls /output/fire_calls_call_types.parquet/CallType=Administrative/

path,name,size,modificationTime
dbfs:/output/fire_calls_call_types.parquet/CallType=Administrative/_SUCCESS,_SUCCESS,0,1663562718000
dbfs:/output/fire_calls_call_types.parquet/CallType=Administrative/_committed_5752530849733393249,_committed_5752530849733393249,915,1663562717000
dbfs:/output/fire_calls_call_types.parquet/CallType=Administrative/_started_5752530849733393249,_started_5752530849733393249,0,1663562608000
dbfs:/output/fire_calls_call_types.parquet/CallType=Administrative/part-00000-tid-5752530849733393249-1682533b-af8a-44d2-b730-4ccd885fd9bc-11-1.c000.snappy.parquet,part-00000-tid-5752530849733393249-1682533b-af8a-44d2-b730-4ccd885fd9bc-11-1.c000.snappy.parquet,9260,1663562614000
dbfs:/output/fire_calls_call_types.parquet/CallType=Administrative/part-00001-tid-5752530849733393249-1682533b-af8a-44d2-b730-4ccd885fd9bc-12-1.c000.snappy.parquet,part-00001-tid-5752530849733393249-1682533b-af8a-44d2-b730-4ccd885fd9bc-12-1.c000.snappy.parquet,10243,1663562614000
dbfs:/output/fire_calls_call_types.parquet/CallType=Administrative/part-00002-tid-5752530849733393249-1682533b-af8a-44d2-b730-4ccd885fd9bc-13-1.c000.snappy.parquet,part-00002-tid-5752530849733393249-1682533b-af8a-44d2-b730-4ccd885fd9bc-13-1.c000.snappy.parquet,8205,1663562614000
dbfs:/output/fire_calls_call_types.parquet/CallType=Administrative/part-00003-tid-5752530849733393249-1682533b-af8a-44d2-b730-4ccd885fd9bc-14-1.c000.snappy.parquet,part-00003-tid-5752530849733393249-1682533b-af8a-44d2-b730-4ccd885fd9bc-14-1.c000.snappy.parquet,10512,1663562614000
dbfs:/output/fire_calls_call_types.parquet/CallType=Administrative/part-00004-tid-5752530849733393249-1682533b-af8a-44d2-b730-4ccd885fd9bc-15-1.c000.snappy.parquet,part-00004-tid-5752530849733393249-1682533b-af8a-44d2-b730-4ccd885fd9bc-15-1.c000.snappy.parquet,8270,1663562614000
dbfs:/output/fire_calls_call_types.parquet/CallType=Administrative/part-00005-tid-5752530849733393249-1682533b-af8a-44d2-b730-4ccd885fd9bc-16-1.c000.snappy.parquet,part-00005-tid-5752530849733393249-1682533b-af8a-44d2-b730-4ccd885fd9bc-16-1.c000.snappy.parquet,8949,1663562614000
dbfs:/output/fire_calls_call_types.parquet/CallType=Administrative/part-00006-tid-5752530849733393249-1682533b-af8a-44d2-b730-4ccd885fd9bc-17-1.c000.snappy.parquet,part-00006-tid-5752530849733393249-1682533b-af8a-44d2-b730-4ccd885fd9bc-17-1.c000.snappy.parquet,9300,1663562614000


Reading and filtering on a certaing column is definitely faster when partitioned.

In [0]:
from pyspark.sql.functions import col

display(
    spark
        .read
        .parquet('/output/fire_calls_call_types.parquet', schema=fire_schema)
        .where(col("CallType") == "Administrative")
)

CallNumber,UnitID,IncidentNumber,CallDate,WatchDate,CallFinalDisposition,AvailableDtTm,Address,City,Zipcode,Battalion,StationArea,Box,OriginalPriority,Priority,FinalPriority,ALSUnit,CallTypeGroup,NumAlarms,UnitType,UnitSequenceInCallDispatch,FirePreventionDistrict,SupervisorDistrict,Neighborhood,Location,RowID,Delay,CallType
140220102,E08,14007454,01/22/2014,01/22/2014,Other,01/22/2014 10:30:55 AM,2300 Block of FOLSOM ST,SF,94110,B06,07,5499.0,,3,3,True,,1,ENGINE,1,6,9,Mission,"(37.7596933117757, -122.414834950578)",140220102-E08,15.766666,Administrative
140520076,96,14017540,02/21/2014,02/21/2014,Other,02/21/2014 09:23:14 AM,1000 Block of TURK ST,SF,94102,B02,05,3426.0,,3,3,True,,1,MEDIC,1,2,5,Western Addition,"(37.7812721471287, -122.425549328462)",140520076-96,23.766666,Administrative
140520216,88,14017645,02/21/2014,02/21/2014,Other,02/21/2014 04:31:23 PM,1400 Block of EVANS AV,,94124,B99,F3,,,3,3,True,,1,MEDIC,1,10,10,Bayview Hunters Point,"(37.7413301917564, -122.385748602368)",140520216-88,5.0833335,Administrative
141020070,52,14034451,04/12/2014,04/11/2014,Other,04/12/2014 03:11:12 PM,400 Block of CHURCH ST,,94114,B02,06,523.0,,3,3,False,,1,MEDIC,3,2,8,Castro/Upper Market,"(37.7632288973036, -122.4286109134)",141020070-52,129.91667,Administrative
141020070,B01,14034451,04/12/2014,04/11/2014,Other,04/12/2014 02:39:30 PM,400 Block of CHURCH ST,,94114,B02,06,523.0,,3,3,False,,1,CHIEF,8,2,8,Castro/Upper Market,"(37.7632288973036, -122.4286109134)",141020070-B01,313.61667,Administrative
141020070,B02,14034451,04/12/2014,04/11/2014,Other,04/12/2014 01:15:59 PM,400 Block of CHURCH ST,,94114,B02,06,523.0,,3,3,False,,1,CHIEF,7,2,8,Castro/Upper Market,"(37.7632288973036, -122.4286109134)",141020070-B02,214.8,Administrative
141020070,B03,14034451,04/12/2014,04/11/2014,Other,04/12/2014 02:43:32 PM,400 Block of CHURCH ST,,94114,B02,06,523.0,,3,3,False,,1,CHIEF,9,2,8,Castro/Upper Market,"(37.7632288973036, -122.4286109134)",141020070-B03,322.28333,Administrative
141020070,B04,14034451,04/12/2014,04/11/2014,Other,04/12/2014 02:50:12 PM,400 Block of CHURCH ST,,94114,B02,06,523.0,,3,3,False,,1,CHIEF,10,2,8,Castro/Upper Market,"(37.7632288973036, -122.4286109134)",141020070-B04,361.68332,Administrative
141020070,B07,14034451,04/12/2014,04/11/2014,Other,04/12/2014 12:32:49 PM,400 Block of CHURCH ST,,94114,B02,06,523.0,,3,3,False,,1,CHIEF,5,2,8,Castro/Upper Market,"(37.7632288973036, -122.4286109134)",141020070-B07,145.63333,Administrative
141020070,B10,14034451,04/12/2014,04/11/2014,Other,04/12/2014 12:42:36 PM,400 Block of CHURCH ST,,94114,B02,06,523.0,,3,3,False,,1,CHIEF,4,2,8,Castro/Upper Market,"(37.7632288973036, -122.4286109134)",141020070-B10,130.3,Administrative


We can now look at further dataframe operations. What if we want to get the number of distinct `CallType` or if we want to list such distinct values?

In [0]:
from pyspark.sql.functions import col, countDistinct
(
    df.select("CallType")
      .where(col("CallType").isNotNull())
      .agg(countDistinct("CallType").alias("DistinctCallTypes"))
      .show()
)

+-----------------+
|DistinctCallTypes|
+-----------------+
|               32|
+-----------------+



In [0]:
(
    df.select("CallType")
    .where(col("CallType").isNotNull())
    .distinct()
    .show(10, False)
)

+-----------------------------+
|CallType                     |
+-----------------------------+
|Alarms                       |
|Odor (Strange / Unknown)     |
|Citizen Assist / Service Call|
|Vehicle Fire                 |
|Other                        |
|Outside Fire                 |
|Electrical Hazard            |
|Structure Fire               |
|Medical Incident             |
|Fuel Spill                   |
+-----------------------------+
only showing top 10 rows



Often, you may want to **rename** columns when working with DataFrames. Example

In [0]:
display(df)

CallNumber,UnitID,IncidentNumber,CallType,CallDate,WatchDate,CallFinalDisposition,AvailableDtTm,Address,City,Zipcode,Battalion,StationArea,Box,OriginalPriority,Priority,FinalPriority,ALSUnit,CallTypeGroup,NumAlarms,UnitType,UnitSequenceInCallDispatch,FirePreventionDistrict,SupervisorDistrict,Neighborhood,Location,RowID,Delay
20110014,M29,2003234,Medical Incident,01/11/2002,01/10/2002,Other,01/11/2002 01:58:43 AM,10TH ST/MARKET ST,SF,94103,B02,36,2338,1,1,2,True,,1,MEDIC,1,2.0,6,Tenderloin,"(37.7765408927183, -122.417501464907)",020110014-M29,5.233333
20110015,M08,2003233,Medical Incident,01/11/2002,01/10/2002,Other,01/11/2002 02:10:17 AM,300 Block of 5TH ST,SF,94107,B03,8,2243,1,1,2,True,,1,MEDIC,1,3.0,6,South of Market,"(37.7792841462441, -122.402061300134)",020110015-M08,3.0833333
20110016,B02,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:47:00 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,CHIEF,6,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-B02,3.05
20110016,B04,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:51:54 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,CHIEF,3,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-B04,2.3166666
20110016,D2,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:47:00 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,CHIEF,4,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-D2,3.0166667
20110016,E03,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:47:00 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,ENGINE,7,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-E03,2.6833334
20110016,E38,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:51:17 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,ENGINE,1,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-E38,2.1
20110016,E41,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:47:00 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,ENGINE,8,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-E41,2.7166667
20110016,M03,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:46:38 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,True,,1,MEDIC,10,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-M03,2.7666667
20110016,RS1,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:46:57 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,RESCUE SQUAD,9,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-RS1,3.2666667


In [0]:
display(
    df.withColumnRenamed("AvailableDtTm", "AvailableDatetime")
)

CallNumber,UnitID,IncidentNumber,CallType,CallDate,WatchDate,CallFinalDisposition,AvailableDatetime,Address,City,Zipcode,Battalion,StationArea,Box,OriginalPriority,Priority,FinalPriority,ALSUnit,CallTypeGroup,NumAlarms,UnitType,UnitSequenceInCallDispatch,FirePreventionDistrict,SupervisorDistrict,Neighborhood,Location,RowID,Delay
20110014,M29,2003234,Medical Incident,01/11/2002,01/10/2002,Other,01/11/2002 01:58:43 AM,10TH ST/MARKET ST,SF,94103,B02,36,2338,1,1,2,True,,1,MEDIC,1,2.0,6,Tenderloin,"(37.7765408927183, -122.417501464907)",020110014-M29,5.233333
20110015,M08,2003233,Medical Incident,01/11/2002,01/10/2002,Other,01/11/2002 02:10:17 AM,300 Block of 5TH ST,SF,94107,B03,8,2243,1,1,2,True,,1,MEDIC,1,3.0,6,South of Market,"(37.7792841462441, -122.402061300134)",020110015-M08,3.0833333
20110016,B02,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:47:00 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,CHIEF,6,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-B02,3.05
20110016,B04,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:51:54 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,CHIEF,3,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-B04,2.3166666
20110016,D2,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:47:00 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,CHIEF,4,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-D2,3.0166667
20110016,E03,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:47:00 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,ENGINE,7,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-E03,2.6833334
20110016,E38,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:51:17 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,ENGINE,1,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-E38,2.1
20110016,E41,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:47:00 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,ENGINE,8,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-E41,2.7166667
20110016,M03,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:46:38 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,True,,1,MEDIC,10,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-M03,2.7666667
20110016,RS1,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:46:57 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,RESCUE SQUAD,9,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-RS1,3.2666667


What if we want to change the type of a `Column`?

In [0]:
df.printSchema()

root
 |-- CallNumber: integer (nullable = true)
 |-- UnitID: string (nullable = true)
 |-- IncidentNumber: integer (nullable = true)
 |-- CallType: string (nullable = true)
 |-- CallDate: string (nullable = true)
 |-- WatchDate: string (nullable = true)
 |-- CallFinalDisposition: string (nullable = true)
 |-- AvailableDtTm: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Zipcode: integer (nullable = true)
 |-- Battalion: string (nullable = true)
 |-- StationArea: string (nullable = true)
 |-- Box: string (nullable = true)
 |-- OriginalPriority: string (nullable = true)
 |-- Priority: string (nullable = true)
 |-- FinalPriority: integer (nullable = true)
 |-- ALSUnit: boolean (nullable = true)
 |-- CallTypeGroup: string (nullable = true)
 |-- NumAlarms: integer (nullable = true)
 |-- UnitType: string (nullable = true)
 |-- UnitSequenceInCallDispatch: integer (nullable = true)
 |-- FirePreventionDistrict: string (nullable = true)
 

For example, the column `CallDate` is actually a `string`. We may want to convert it to a date so that we can process it more easily. Same holds for `AvailableDtTm`

In [0]:
from pyspark.sql.functions import to_timestamp, to_date
display(
    df.withColumn("IncidentDate", to_date(col("CallDate"), "MM/dd/yyyy")).drop("CallDate")
    .withColumn("AvailableDatetime", to_timestamp(col("AvailableDtTm"), "MM/dd/yyyy hh:mm:ss a"))
    .drop("AvailableDtTm")
    .select("IncidentDate", "AvailableDatetime")
)

IncidentDate,AvailableDatetime
2002-01-11,2002-01-11T01:58:43.000+0000
2002-01-11,2002-01-11T02:10:17.000+0000
2002-01-11,2002-01-11T01:47:00.000+0000
2002-01-11,2002-01-11T01:51:54.000+0000
2002-01-11,2002-01-11T01:47:00.000+0000
2002-01-11,2002-01-11T01:47:00.000+0000
2002-01-11,2002-01-11T01:51:17.000+0000
2002-01-11,2002-01-11T01:47:00.000+0000
2002-01-11,2002-01-11T01:46:38.000+0000
2002-01-11,2002-01-11T01:46:57.000+0000


Now that a column like `IncidentDate` is a `date`, we can apply date-related methods to process it.

In [0]:
from pyspark.sql.functions import year

display(
    df.withColumn("IncidentDate", to_date(col("CallDate"), "MM/dd/yyyy"))
        .withColumn("IncidentYear", year(col("IncidentDate")))
        .groupBy("IncidentYear").count()
        .sort(col("IncidentYear").asc())
)


IncidentYear,count
2000,139200
2001,194309
2002,201575
2003,214503
2004,211056
2005,204569
2006,204616
2007,208250
2008,221652
2009,217800


We have just seen the `groupBy` method in action combined with a `count`. What if we want to apply multiple aggregation functions in a single grouping?

Consider following example: for each `Priority` level, compute the minimum, the maximum and the average `Delay`

In [0]:
import pyspark.sql.functions as F #to avoid conflicts with Python built-in min and max
display(
    df.groupBy("Priority")
    .agg(
        F.min("Delay"),
        F.avg("Delay"),
        F.max("Delay")
    )
    .sort(col("Priority").asc())
)

Priority,min(Delay),avg(Delay),max(Delay)
,10.8,12.224999904632568,13.65
1,0.016666668,5.236210810876286,428.8
2,0.016666668,4.586479322375996,887.95
3,0.016666668,3.7075419009990425,1879.6167
A,0.18333334,3.9272709982136775,87.166664
B,0.15,5.210766671180725,105.316666
C,0.28333333,4.481972793129836,25.583334
E,0.083333336,3.4372339203867206,883.63336
I,0.93333334,5.124999975030486,20.416666


## Excercise

1. Consider the calls where `CallType` is `Structure Fire`. Compute the average `Delay` over the months of the year. For example, is the delay higher in December or in June?
2. Which neighborhood in San Francisco generated the most fire calls in 2018?
3. Can you store the number of calls by year in a Parquet file?