In [3]:
import pyspark
from pyspark.sql import SparkSession 

### Inspired by pandas DataFrames in structure, format, and a few specific operations,
#### Spark DataFrames are like distributed in-memory tables with named columns and schemas, where each column has a specific data type: integer, string, array, map, real,date, timestamp, etc.

In [4]:
spark = SparkSession.builder.getOrCreate()

### Dealing with missing data with pyspark
#### Missing Data
1. Keep them.
2. Remove them.
3. Fill them with some values.

In [5]:
df = spark.read.csv('NullData.csv', header=True, inferSchema=True)

In [6]:
df.show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| NULL|
|emp2| NULL| NULL|
|emp3| NULL|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



In [7]:
df.printSchema()

root
 |-- Id: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sales: double (nullable = true)



In [8]:
## How to deal with Missin Values in Spark?
df.na.drop().show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp4|Cindy|456.0|
+----+-----+-----+



#### If specified, drop rows that have less than `thresh` non-null values.
#### This overwrites the `how` parameter.

In [9]:

df.na.drop(thresh=1).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| NULL|
|emp2| NULL| NULL|
|emp3| NULL|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



In [13]:
## thresh=2 means that the row that has at least two values in it
df.na.drop(thresh=2).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| NULL|
|emp3| NULL|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



In [10]:
## thresh=3 means that the row that has at least 3 values in it

df.na.drop(thresh=3).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp4|Cindy|456.0|
+----+-----+-----+



In [16]:
df.na.drop(subset=['Sales']).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp3| NULL|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



In [18]:
df.na.drop(subset=['Name','Sales'], thresh=2).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp4|Cindy|456.0|
+----+-----+-----+



In [21]:
## how to fill the null values in Spark?

df.na.fill('No Name').show()
## it went to the string column that has nulls automatically by spark

+----+-------+-----+
|  Id|   Name|Sales|
+----+-------+-----+
|emp1|   John| NULL|
|emp2|No Name| NULL|
|emp3|No Name|345.0|
|emp4|  Cindy|456.0|
+----+-------+-----+



In [20]:
df.na.fill(25).show()
## it went to the integer column that has nulls automatically by spark


+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| 25.0|
|emp2| NULL| 25.0|
|emp3| NULL|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



In [22]:
df.na.fill(25, subset=['Sales']).show()


+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| 25.0|
|emp2| NULL| 25.0|
|emp3| NULL|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



### DataFrame Operations

In Python, it’s possible to access a DataFrame’s columns either by attribute <b>(df.age)</b> or by indexing <b>(df['age'])</b>. While the former is convenient for interactive data exploration, users are highly encouraged to use the latter form, which is future proof and won’t break with column names that are also attributes on the DataFrame class.

In [27]:
from pyspark.sql.functions import mean 
mean_value = df.select(mean(df['Sales']).alias('SalesMean')).collect()[0].SalesMean


##mean_value = df.select(mean(df['Sales']).alias('SalesMean')).collect()[0][0]


In [28]:
mean_value

400.5

In [30]:
df.na.fill(mean_value, subset= ['Sales']).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John|400.5|
|emp2| NULL|400.5|
|emp3| NULL|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



In [34]:
## Sometime in machine learning, you want to transform subset of Spark DF to Pandas DF 
##to make some feature engineering, here is how to do it 
df_toPandas = df.toPandas()

In [35]:
df_toPandas.head()

Unnamed: 0,Id,Name,Sales
0,emp1,John,
1,emp2,,
2,emp3,,345.0
3,emp4,Cindy,456.0


## Schemas and Creating DataFrames

A schema in Spark defines the column names and associated data types for a DataFrame. Most often, schemas come into play when you are reading structured data
from an external data source Defining a schema
up front as opposed to taking a schema-on-read approach offers three benefits:
<b>
1. You relieve Spark from the onus of inferring data types.
2. You prevent Spark from creating a separate job just to read a large portion of your file to ascertain the schema, which for a large data file can be expensive and time-consuming.
3. You can detect errors early if data doesn’t match the schema.
</b>

<i>So, it is encouraged to always define your schema up front whenever you want to
read a large file from a data source.</i>

In [37]:
df_fire = spark.read.csv('sf-fire-calls.csv', header=True,inferSchema=True)

In [38]:
df_fire.printSchema()

root
 |-- CallNumber: integer (nullable = true)
 |-- UnitID: string (nullable = true)
 |-- IncidentNumber: integer (nullable = true)
 |-- CallType: string (nullable = true)
 |-- CallDate: string (nullable = true)
 |-- WatchDate: string (nullable = true)
 |-- CallFinalDisposition: string (nullable = true)
 |-- AvailableDtTm: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Zipcode: integer (nullable = true)
 |-- Battalion: string (nullable = true)
 |-- StationArea: string (nullable = true)
 |-- Box: string (nullable = true)
 |-- OriginalPriority: string (nullable = true)
 |-- Priority: string (nullable = true)
 |-- FinalPriority: integer (nullable = true)
 |-- ALSUnit: boolean (nullable = true)
 |-- CallTypeGroup: string (nullable = true)
 |-- NumAlarms: integer (nullable = true)
 |-- UnitType: string (nullable = true)
 |-- UnitSequenceInCallDispatch: integer (nullable = true)
 |-- FirePreventionDistrict: string (nullable = true)
 

#### we are working with BigData, to tell Spark to infer the schema with huge dataset
#### it will go to read all the dataset and this may be too much load and processing for Spark
#### to solve that, use samplingRatio=0.001 feature, to take only small sample to infer the schema

In [39]:

df_fire_sample = spark.read.csv('sf-fire-calls.csv', header=True,inferSchema=True, samplingRatio=0.001)

In [40]:
df_fire_sample.printSchema()

root
 |-- CallNumber: integer (nullable = true)
 |-- UnitID: string (nullable = true)
 |-- IncidentNumber: integer (nullable = true)
 |-- CallType: string (nullable = true)
 |-- CallDate: string (nullable = true)
 |-- WatchDate: string (nullable = true)
 |-- CallFinalDisposition: string (nullable = true)
 |-- AvailableDtTm: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Zipcode: integer (nullable = true)
 |-- Battalion: string (nullable = true)
 |-- StationArea: integer (nullable = true)
 |-- Box: integer (nullable = true)
 |-- OriginalPriority: string (nullable = true)
 |-- Priority: string (nullable = true)
 |-- FinalPriority: integer (nullable = true)
 |-- ALSUnit: boolean (nullable = true)
 |-- CallTypeGroup: string (nullable = true)
 |-- NumAlarms: integer (nullable = true)
 |-- UnitType: string (nullable = true)
 |-- UnitSequenceInCallDispatch: integer (nullable = true)
 |-- FirePreventionDistrict: string (nullable = true)

In [41]:
df_fire_sample.schema

StructType([StructField('CallNumber', IntegerType(), True), StructField('UnitID', StringType(), True), StructField('IncidentNumber', IntegerType(), True), StructField('CallType', StringType(), True), StructField('CallDate', StringType(), True), StructField('WatchDate', StringType(), True), StructField('CallFinalDisposition', StringType(), True), StructField('AvailableDtTm', StringType(), True), StructField('Address', StringType(), True), StructField('City', StringType(), True), StructField('Zipcode', IntegerType(), True), StructField('Battalion', StringType(), True), StructField('StationArea', IntegerType(), True), StructField('Box', IntegerType(), True), StructField('OriginalPriority', StringType(), True), StructField('Priority', StringType(), True), StructField('FinalPriority', IntegerType(), True), StructField('ALSUnit', BooleanType(), True), StructField('CallTypeGroup', StringType(), True), StructField('NumAlarms', IntegerType(), True), StructField('UnitType', StringType(), True), 

#### Now you have the schema from the sample, you can take it and pass to the original dataset

In [42]:
df_fire = spark.read.csv('sf-fire-calls.csv', header=True,schema=df_fire_sample.schema)

In [43]:
df_fire.printSchema()

root
 |-- CallNumber: integer (nullable = true)
 |-- UnitID: string (nullable = true)
 |-- IncidentNumber: integer (nullable = true)
 |-- CallType: string (nullable = true)
 |-- CallDate: string (nullable = true)
 |-- WatchDate: string (nullable = true)
 |-- CallFinalDisposition: string (nullable = true)
 |-- AvailableDtTm: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Zipcode: integer (nullable = true)
 |-- Battalion: string (nullable = true)
 |-- StationArea: integer (nullable = true)
 |-- Box: integer (nullable = true)
 |-- OriginalPriority: string (nullable = true)
 |-- Priority: string (nullable = true)
 |-- FinalPriority: integer (nullable = true)
 |-- ALSUnit: boolean (nullable = true)
 |-- CallTypeGroup: string (nullable = true)
 |-- NumAlarms: integer (nullable = true)
 |-- UnitType: string (nullable = true)
 |-- UnitSequenceInCallDispatch: integer (nullable = true)
 |-- FirePreventionDistrict: string (nullable = true)

In [45]:
from pyspark.sql.functions import *
import pyspark.sql.functions as fn

In [50]:
df_Medical = df_fire.select('IncidentNumber', 'AvailableDtTm', 'CallType')\
        .where(col('CallType')=='Medical Incident') 

In [51]:
df_Medical.show(5, truncate=False)

+--------------+----------------------+----------------+
|IncidentNumber|AvailableDtTm         |CallType        |
+--------------+----------------------+----------------+
|2003241       |01/11/2002 03:01:18 AM|Medical Incident|
|2003242       |01/11/2002 02:39:50 AM|Medical Incident|
|2003343       |01/11/2002 12:06:57 PM|Medical Incident|
|2003348       |01/11/2002 01:08:40 PM|Medical Incident|
|2003381       |01/11/2002 03:31:02 PM|Medical Incident|
+--------------+----------------------+----------------+
only showing top 5 rows



In [53]:
df_MedicalNotNull = df_fire.select('IncidentNumber', 'AvailableDtTm', 'CallType')\
        .where(col('CallType').isNotNull()) 

In [54]:
df_MedicalNotNull.show(30)

+--------------+--------------------+--------------------+
|IncidentNumber|       AvailableDtTm|            CallType|
+--------------+--------------------+--------------------+
|       2003235|01/11/2002 01:51:...|      Structure Fire|
|       2003241|01/11/2002 03:01:...|    Medical Incident|
|       2003242|01/11/2002 02:39:...|    Medical Incident|
|       2003250|01/11/2002 04:16:...|        Vehicle Fire|
|       2003259|01/11/2002 06:01:...|              Alarms|
|       2003279|01/11/2002 08:03:...|      Structure Fire|
|       2003301|01/11/2002 09:46:...|              Alarms|
|       2003304|01/11/2002 09:58:...|              Alarms|
|       2003343|01/11/2002 12:06:...|    Medical Incident|
|       2003348|01/11/2002 01:08:...|    Medical Incident|
|       2003381|01/11/2002 03:31:...|    Medical Incident|
|       2003382|01/11/2002 02:59:...|      Structure Fire|
|       2003399|01/11/2002 04:22:...|    Medical Incident|
|       2003403|01/11/2002 04:18:...|    Medical Inciden

In [63]:
df_CallDistinct = df_fire.select('CallType')\
        .where(col('CallType').isNotNull()).distinct()

In [65]:
df_CallDistinct.show()

+--------------------+
|            CallType|
+--------------------+
|Elevator / Escala...|
|         Marine Fire|
|  Aircraft Emergency|
|      Administrative|
|              Alarms|
|Odor (Strange / U...|
|Citizen Assist / ...|
|              HazMat|
|Watercraft in Dis...|
|           Explosion|
|           Oil Spill|
|        Vehicle Fire|
|  Suspicious Package|
|Extrication / Ent...|
|               Other|
|        Outside Fire|
|   Traffic Collision|
|       Assist Police|
|Gas Leak (Natural...|
|        Water Rescue|
+--------------------+
only showing top 20 rows



In [71]:
df_fire.select('CallType','City','UnitID').where(col('CallType').isNotNull()) \
        .distinct() \
        .sort('CallType', ascending=False) \
        .show(50,truncate=False)

+----------------------+-------------+------+
|CallType              |City         |UnitID|
+----------------------+-------------+------+
|Watercraft in Distress|SF           |E35   |
|Watercraft in Distress|SF           |RC1   |
|Watercraft in Distress|SF           |E16   |
|Watercraft in Distress|PR           |E34   |
|Watercraft in Distress|SF           |T08   |
|Watercraft in Distress|SF           |E02   |
|Watercraft in Distress|SF           |E13   |
|Watercraft in Distress|SF           |E28   |
|Watercraft in Distress|San Francisco|E35   |
|Watercraft in Distress|SF           |FB1   |
|Watercraft in Distress|SF           |RB1   |
|Watercraft in Distress|TI           |B03   |
|Watercraft in Distress|SAN FRANCISCO|B08   |
|Watercraft in Distress|FM           |94    |
|Watercraft in Distress|San Francisco|B10   |
|Watercraft in Distress|San Francisco|RB1   |
|Watercraft in Distress|San Francisco|RA48  |
|Watercraft in Distress|San Francisco|RS2   |
|Watercraft in Distress|SF        

#### Important Operations for Spark DF

In [74]:
df_fire2 = df_fire.withColumn('Delay in Seconds', col('Delay') * 60)

In [76]:
df_fire2.printSchema()

root
 |-- CallNumber: integer (nullable = true)
 |-- UnitID: string (nullable = true)
 |-- IncidentNumber: integer (nullable = true)
 |-- CallType: string (nullable = true)
 |-- CallDate: string (nullable = true)
 |-- WatchDate: string (nullable = true)
 |-- CallFinalDisposition: string (nullable = true)
 |-- AvailableDtTm: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Zipcode: integer (nullable = true)
 |-- Battalion: string (nullable = true)
 |-- StationArea: integer (nullable = true)
 |-- Box: integer (nullable = true)
 |-- OriginalPriority: string (nullable = true)
 |-- Priority: string (nullable = true)
 |-- FinalPriority: integer (nullable = true)
 |-- ALSUnit: boolean (nullable = true)
 |-- CallTypeGroup: string (nullable = true)
 |-- NumAlarms: integer (nullable = true)
 |-- UnitType: string (nullable = true)
 |-- UnitSequenceInCallDispatch: integer (nullable = true)
 |-- FirePreventionDistrict: string (nullable = true)

In [82]:
df_fire2.withColumnRenamed('Delay', 'DelayInMins')

DataFrame[CallNumber: int, UnitID: string, IncidentNumber: int, CallType: string, CallDate: string, WatchDate: string, CallFinalDisposition: string, AvailableDtTm: string, Address: string, City: string, Zipcode: int, Battalion: string, StationArea: int, Box: int, OriginalPriority: string, Priority: string, FinalPriority: int, ALSUnit: boolean, CallTypeGroup: string, NumAlarms: int, UnitType: string, UnitSequenceInCallDispatch: int, FirePreventionDistrict: string, SupervisorDistrict: int, Neighborhood: string, Location: string, RowID: string, DelayInMins: double, Delay in Seconds: double]

In [85]:
df_fire_dt = df_fire.withColumn('IncidentDate',to_timestamp(col('CallDate'), 'MM/dd/yyyy')).drop('CallDate')

In [86]:
df_fire_dt.printSchema()

root
 |-- CallNumber: integer (nullable = true)
 |-- UnitID: string (nullable = true)
 |-- IncidentNumber: integer (nullable = true)
 |-- CallType: string (nullable = true)
 |-- WatchDate: string (nullable = true)
 |-- CallFinalDisposition: string (nullable = true)
 |-- AvailableDtTm: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Zipcode: integer (nullable = true)
 |-- Battalion: string (nullable = true)
 |-- StationArea: integer (nullable = true)
 |-- Box: integer (nullable = true)
 |-- OriginalPriority: string (nullable = true)
 |-- Priority: string (nullable = true)
 |-- FinalPriority: integer (nullable = true)
 |-- ALSUnit: boolean (nullable = true)
 |-- CallTypeGroup: string (nullable = true)
 |-- NumAlarms: integer (nullable = true)
 |-- UnitType: string (nullable = true)
 |-- UnitSequenceInCallDispatch: integer (nullable = true)
 |-- FirePreventionDistrict: string (nullable = true)
 |-- SupervisorDistrict: integer (nulla

In [93]:
df_fire_dt.select('CallType','CallNumber', 'IncidentDate').where(fn.year('IncidentDate')==2003).alias('IncidentYear').show()

+----------------+----------+-------------------+
|        CallType|CallNumber|       IncidentDate|
+----------------+----------+-------------------+
|Medical Incident|  30010041|2003-01-01 00:00:00|
|Medical Incident|  30010045|2003-01-01 00:00:00|
|Medical Incident|  30010068|2003-01-01 00:00:00|
|          Alarms|  30010080|2003-01-01 00:00:00|
|  Structure Fire|  30010086|2003-01-01 00:00:00|
|Medical Incident|  30010134|2003-01-01 00:00:00|
|          Alarms|  30010135|2003-01-01 00:00:00|
|    Vehicle Fire|  30010140|2003-01-01 00:00:00|
|Medical Incident|  30010176|2003-01-01 00:00:00|
|Medical Incident|  30010226|2003-01-01 00:00:00|
|           Other|  30010240|2003-01-01 00:00:00|
|Medical Incident|  30010310|2003-01-01 00:00:00|
|  Structure Fire|  30010316|2003-01-01 00:00:00|
|Medical Incident|  30010348|2003-01-01 00:00:00|
|Medical Incident|  30010360|2003-01-01 00:00:00|
|Medical Incident|  30010361|2003-01-01 00:00:00|
|Medical Incident|  30010377|2003-01-01 00:00:00|


In [97]:
df_fire_dt.select('CallType','CallNumber',fn.year('IncidentDate').alias('IncidentYear')) \
            .where('IncidentYear=2000') \
            .show()

+--------------------+----------+------------+
|            CallType|CallNumber|IncidentYear|
+--------------------+----------+------------+
|    Medical Incident|   1040031|        2000|
|Citizen Assist / ...|   1040086|        2000|
|    Medical Incident|   1040236|        2000|
|        Outside Fire|   1040263|        2000|
|    Medical Incident|   1050006|        2000|
|    Medical Incident|   1050046|        2000|
|    Medical Incident|   1050051|        2000|
|    Medical Incident|   1050103|        2000|
|    Medical Incident|   1050154|        2000|
|    Medical Incident|   1050186|        2000|
|    Medical Incident|   1050312|        2000|
|    Medical Incident|   1050364|        2000|
|    Medical Incident|   1050374|        2000|
|    Medical Incident|   1060076|        2000|
|              Alarms|   1060094|        2000|
|    Medical Incident|   1060128|        2000|
|      Structure Fire|   1060140|        2000|
|               Other|   1060165|        2000|
|      Struct

#### Thank you!