In [3]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[1]").appName('PySparkLearning').getOrCreate()

In [4]:
filePath="../Resources/small_zipcode.csv"
df = spark.read.options(header='true', inferSchema='true') \
          .csv(filePath)

df.printSchema()
df.show(truncate=False)

root
 |-- id: integer (nullable = true)
 |-- zipcode: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- population: integer (nullable = true)

+---+-------+--------+-------------------+-----+----------+
|id |zipcode|type    |city               |state|population|
+---+-------+--------+-------------------+-----+----------+
|1  |704    |STANDARD|null               |PR   |30100     |
|2  |704    |null    |PASEO COSTA DEL SUR|PR   |null      |
|3  |709    |null    |BDA SAN LUIS       |PR   |3700      |
|4  |76166  |UNIQUE  |CINGULAR WIRELESS  |TX   |84000     |
|5  |76177  |STANDARD|null               |TX   |null      |
+---+-------+--------+-------------------+-----+----------+



### PySpark Drop Rows with NULL Values

DataFrame/Dataset has a variable `na` which is an instance of class `DataFrameNaFunctions` hence, you should be using `na` variable on DataFrame to use drop(). 

##### Drop Rows with NULL Values in Any Columns
By default `drop(`) without arguments remove all rows that have null values on any column of DataFrame.


In [6]:
df.na.drop().show()

+---+-------+------+-----------------+-----+----------+
| id|zipcode|  type|             city|state|population|
+---+-------+------+-----------------+-----+----------+
|  4|  76166|UNIQUE|CINGULAR WIRELESS|   TX|     84000|
+---+-------+------+-----------------+-----+----------+




This removes all rows with null values and returns the clean DataFrame with id=4 where it doesn’t have any NULL values.

Alternatively you can also get same result with `na.drop("any")`.


In [8]:
df.na.drop("any").show()

+---+-------+------+-----------------+-----+----------+
| id|zipcode|  type|             city|state|population|
+---+-------+------+-----------------+-----+----------+
|  4|  76166|UNIQUE|CINGULAR WIRELESS|   TX|     84000|
+---+-------+------+-----------------+-----+----------+



#### Drop Rows with NULL Values on All Columns
Below example drops all rows that has NULL values on all columns. Our DataFrame doesn’t have null values on all rows hence below examples returns all rows.


In [11]:
df.na.drop("all").show()

+---+-------+--------+-------------------+-----+----------+
| id|zipcode|    type|               city|state|population|
+---+-------+--------+-------------------+-----+----------+
|  1|    704|STANDARD|               null|   PR|     30100|
|  2|    704|    null|PASEO COSTA DEL SUR|   PR|      null|
|  3|    709|    null|       BDA SAN LUIS|   PR|      3700|
|  4|  76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|
|  5|  76177|STANDARD|               null|   TX|      null|
+---+-------+--------+-------------------+-----+----------+



### Drop Rows with NULL Values on Selected Columns

In order to remove Rows with NULL values on selected columns of PySpark DataFrame, use `drop(columns:Seq[String]) or drop(columns:Array[String])`. To these functions pass the names of the columns you wanted to check for NULL values to delete rows.



In [13]:
df.na.drop(subset=["population","type"]) \
   .show(truncate=False)

# Removes rows that have NULL values on population and type columns.

+---+-------+--------+-----------------+-----+----------+
|id |zipcode|type    |city             |state|population|
+---+-------+--------+-----------------+-----+----------+
|1  |704    |STANDARD|null             |PR   |30100     |
|4  |76166  |UNIQUE  |CINGULAR WIRELESS|TX   |84000     |
+---+-------+--------+-----------------+-----+----------+



#### Using dropna() of DataFrame
Below is a PySpark example of using `dropna()` function of DataFrame to drop rows with NULL values.


In [14]:
df.dropna().show(truncate=False)

+---+-------+------+-----------------+-----+----------+
|id |zipcode|type  |city             |state|population|
+---+-------+------+-----------------+-----+----------+
|4  |76166  |UNIQUE|CINGULAR WIRELESS|TX   |84000     |
+---+-------+------+-----------------+-----+----------+

