## DataFrame Operations

Most of Spark job start by reading data somewhere. The [DataFrameReader](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.html) interface lets us load a `DataFrame` from a variety of data sources like CSV, Parquet, JDBC, JSON, etc.

We'll use an example from _Learning Spark v2_ book, _Chapter 3_.

In [0]:
from pyspark.sql.types import IntegerType, StringType, BooleanType, FloatType, StructType, StructField

fire_schema = StructType([
    StructField('CallNumber', IntegerType(), True),
    StructField('UnitID', StringType(), True),
    StructField('IncidentNumber', IntegerType(), True),
    StructField('CallType', StringType(), True),
    StructField('CallDate', StringType(), True),
    StructField('WatchDate', StringType(), True),
    StructField('CallFinalDisposition', StringType(), True),
    StructField('AvailableDtTm', StringType(), True),
    StructField('Address', StringType(), True),
    StructField('City', StringType(), True),
    StructField('Zipcode', IntegerType(), True),
    StructField('Battalion', StringType(), True),
    StructField('StationArea', StringType(), True),
    StructField('Box', StringType(), True),
    StructField('OriginalPriority', StringType(), True),
    StructField('Priority', StringType(), True),
    StructField('FinalPriority', IntegerType(), True),
    StructField('ALSUnit', BooleanType(), True),
    StructField('CallTypeGroup', StringType(), True),
    StructField('NumAlarms', IntegerType(), True),
    StructField('UnitType', StringType(), True),
    StructField('UnitSequenceInCallDispatch', IntegerType(), True),
    StructField('FirePreventionDistrict', StringType(), True),
    StructField('SupervisorDistrict', StringType(), True),
    StructField('Neighborhood', StringType(), True),
    StructField('Location', StringType(), True),
    StructField('RowID', StringType(), True),
    StructField('Delay', FloatType(), True)
])

In [0]:
sf_fire_file = "/databricks-datasets/learning-spark-v2/sf-fire/sf-fire-calls.csv"
df = spark.read.csv(sf_fire_file, header=True, schema=fire_schema)

In [0]:
display(df)

CallNumber,UnitID,IncidentNumber,CallType,CallDate,WatchDate,CallFinalDisposition,AvailableDtTm,Address,City,Zipcode,Battalion,StationArea,Box,OriginalPriority,Priority,FinalPriority,ALSUnit,CallTypeGroup,NumAlarms,UnitType,UnitSequenceInCallDispatch,FirePreventionDistrict,SupervisorDistrict,Neighborhood,Location,RowID,Delay
20110014,M29,2003234,Medical Incident,01/11/2002,01/10/2002,Other,01/11/2002 01:58:43 AM,10TH ST/MARKET ST,SF,94103,B02,36,2338,1,1,2,True,,1,MEDIC,1,2.0,6,Tenderloin,"(37.7765408927183, -122.417501464907)",020110014-M29,5.233333
20110015,M08,2003233,Medical Incident,01/11/2002,01/10/2002,Other,01/11/2002 02:10:17 AM,300 Block of 5TH ST,SF,94107,B03,8,2243,1,1,2,True,,1,MEDIC,1,3.0,6,South of Market,"(37.7792841462441, -122.402061300134)",020110015-M08,3.0833333
20110016,B02,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:47:00 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,CHIEF,6,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-B02,3.05
20110016,B04,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:51:54 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,CHIEF,3,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-B04,2.3166666
20110016,D2,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:47:00 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,CHIEF,4,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-D2,3.0166667
20110016,E03,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:47:00 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,ENGINE,7,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-E03,2.6833334
20110016,E38,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:51:17 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,ENGINE,1,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-E38,2.1
20110016,E41,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:47:00 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,ENGINE,8,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-E41,2.7166667
20110016,M03,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:46:38 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,True,,1,MEDIC,10,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-M03,2.7666667
20110016,RS1,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:46:57 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,False,,1,RESCUE SQUAD,9,4.0,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-RS1,3.2666667


## Writing data
We can use the [DataFrameWriter](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.html) interface to persist data in a variety of formats.

Now, let's filter the `DataFrame` and persist it. You can specify the format that Spark uses to persist via `.format`.

In [0]:
%fs mkdirs output

In [0]:
df.where(df.UnitID == 'M29').write.format("parquet").save('/output/fire_calls_m29.parquet')

In [0]:
%fs ls /output/

path,name,size,modificationTime
dbfs:/output/fire_calls_m29.parquet/,fire_calls_m29.parquet/,0,0


Mmm, something is strange. Why that trailing `/`? That's actually a folder. Let's look into it.

In [0]:
%fs ls /output/fire_calls_m29.parquet

path,name,size,modificationTime
dbfs:/output/fire_calls_m29.parquet/_SUCCESS,_SUCCESS,0,1659937732000
dbfs:/output/fire_calls_m29.parquet/_committed_5233398857014364678,_committed_5233398857014364678,519,1659937731000
dbfs:/output/fire_calls_m29.parquet/_started_5233398857014364678,_started_5233398857014364678,0,1659937692000
dbfs:/output/fire_calls_m29.parquet/part-00000-tid-5233398857014364678-9d9f456c-7ac0-445c-8ce5-5c7cf9aefabe-44-1-c000.snappy.parquet,part-00000-tid-5233398857014364678-9d9f456c-7ac0-445c-8ce5-5c7cf9aefabe-44-1-c000.snappy.parquet,423837,1659937730000
dbfs:/output/fire_calls_m29.parquet/part-00001-tid-5233398857014364678-9d9f456c-7ac0-445c-8ce5-5c7cf9aefabe-45-1-c000.snappy.parquet,part-00001-tid-5233398857014364678-9d9f456c-7ac0-445c-8ce5-5c7cf9aefabe-45-1-c000.snappy.parquet,119298,1659937728000
dbfs:/output/fire_calls_m29.parquet/part-00005-tid-5233398857014364678-9d9f456c-7ac0-445c-8ce5-5c7cf9aefabe-49-1-c000.snappy.parquet,part-00005-tid-5233398857014364678-9d9f456c-7ac0-445c-8ce5-5c7cf9aefabe-49-1-c000.snappy.parquet,149141,1659937727000
dbfs:/output/fire_calls_m29.parquet/part-00006-tid-5233398857014364678-9d9f456c-7ac0-445c-8ce5-5c7cf9aefabe-50-1-c000.snappy.parquet,part-00006-tid-5233398857014364678-9d9f456c-7ac0-445c-8ce5-5c7cf9aefabe-50-1-c000.snappy.parquet,218665,1659937728000
dbfs:/output/fire_calls_m29.parquet/part-00007-tid-5233398857014364678-9d9f456c-7ac0-445c-8ce5-5c7cf9aefabe-51-1-c000.snappy.parquet,part-00007-tid-5233398857014364678-9d9f456c-7ac0-445c-8ce5-5c7cf9aefabe-51-1-c000.snappy.parquet,169449,1659937727000


Spark has actually partitioned the dataset behind the hood in multiple Parquet files. This happens automatically and we might not care about it most of the times. We can actually control how the dataset is partitioned by Spark via `partitionBy` method, but we'll look at this later.

The partition is hidden to us even when reading the folder.

In [0]:
display(
    spark
        .read
        .option('header',True)
        .parquet('/output/fire_calls_m29.parquet') # this is actually a folder
)

CallNumber,UnitID,IncidentNumber,CallType,CallDate,WatchDate,CallFinalDisposition,AvailableDtTm,Address,City,Zipcode,Battalion,StationArea,Box,OriginalPriority,Priority,FinalPriority,ALSUnit,CallTypeGroup,NumAlarms,UnitType,UnitSequenceInCallDispatch,FirePreventionDistrict,SupervisorDistrict,Neighborhood,Location,RowID,Delay
20110014,M29,2003234,Medical Incident,01/11/2002,01/10/2002,Other,01/11/2002 01:58:43 AM,10TH ST/MARKET ST,SF,94103.0,B02,36,2338,1,1,2,True,,1,MEDIC,1,2.0,6.0,Tenderloin,"(37.7765408927183, -122.417501464907)",020110014-M29,5.233333
20110024,M29,2003243,Medical Incident,01/11/2002,01/10/2002,Other,01/11/2002 02:48:59 AM,400 Block of KANSAS ST,SF,94107.0,B02,29,2422,3,3,3,True,,1,MEDIC,1,2.0,10.0,Potrero Hill,"(37.7640801201771, -122.403538779847)",020110024-M29,3.35
20110372,M29,2003452,Medical Incident,01/11/2002,01/11/2002,Other,01/11/2002 07:22:34 PM,23RD ST/TENNESSEE ST,SF,94107.0,B10,25,2574,3,3,3,True,,1,MEDIC,3,10.0,10.0,Potrero Hill,"(37.7553296242037, -122.389001459907)",020110372-M29,2.3166666
20110400,M29,2003481,Medical Incident,01/11/2002,01/11/2002,Other,01/11/2002 08:42:15 PM,MISSOURI ST/TURNER TR,SF,94107.0,B10,37,2542,1,1,2,True,,1,MEDIC,1,10.0,10.0,Potrero Hill,"(37.7570408501669, -122.395683799309)",020110400-M29,12.7
20110424,M29,2003498,Medical Incident,01/11/2002,01/11/2002,Other,01/11/2002 09:21:40 PM,0 Block of 7TH ST,SF,94103.0,B02,1,2316,3,3,3,True,,1,MEDIC,1,2.0,6.0,South of Market,"(37.7802039058686, -122.412272455406)",020110424-M29,5.0666666
20110472,M29,2003533,Medical Incident,01/11/2002,01/11/2002,Other,01/12/2002 12:04:08 AM,2000 Block of 17TH ST,SF,94107.0,B02,29,2422,3,3,3,True,,1,MEDIC,1,2.0,10.0,Potrero Hill,"(37.7646388261392, -122.404006760452)",020110472-M29,2.65
20120015,M29,2003549,Medical Incident,01/12/2002,01/11/2002,Other,01/12/2002 02:08:26 AM,200 Block of 8TH ST,SF,94103.0,B02,29,2322,2,2,2,True,,1,MEDIC,1,2.0,6.0,South of Market,"(37.7752488388293, -122.410287728099)",020120015-M29,3.7333333
20120036,M29,2003568,Medical Incident,01/12/2002,01/11/2002,Other,01/12/2002 03:51:57 AM,1100 Block of HAMPSHIRE ST,SF,94110.0,B10,37,2554,3,3,3,True,,1,MEDIC,1,6.0,9.0,Mission,"(37.7537021987795, -122.407438841214)",020120036-M29,3.2
20120058,M29,2003590,Medical Incident,01/12/2002,01/11/2002,Other,01/12/2002 07:59:15 AM,200 Block of 8TH ST,SF,94103.0,B02,29,2322,3,3,3,True,,1,MEDIC,1,2.0,6.0,South of Market,"(37.7760463471271, -122.41128875661)",020120058-M29,3.7
20120076,M29,2003601,Medical Incident,01/12/2002,01/12/2002,Other,01/12/2002 09:21:33 AM,CHURCH ST/MARKET ST,SF,94114.0,B02,6,5213,2,2,2,True,,1,MEDIC,1,2.0,8.0,Castro/Upper Market,"(37.7675044595983, -122.428948567606)",020120076-M29,5.65
