## Overview

In this article we will be looking at how to handle the missing values using PySpark, as we all know that handling the missing value is one of the most critical part of any data exploration and analysis pipeline and when we have a large dataset so data engineers should have enough skills to handle the NA/missing values in the dataset.

This is the second article in the PySpark's series if you don't understand the basics of dataframe in Pyspark then I'll suggest you to go through my previous article on: Data Preprocessing using PySpark - Pyspark's DataFrame.

## Table of content
Handling NULL values and missing values using **`PySpark`**
* **Spark Session:** Starting the spark session - Mandatory.
* **Reading the dataset:** Reading the **Dummy** dataset.
* **Dropping columns:** Droping the columns which have null values and know when to drop the complete columns.
* **Dropping rows:** Dropping particular rows based on the null values encountered.
* **Parameter in Dropping functionalities:** Know about various parameter in the dropping function of PySpark.
* **Missing values by Mean, Median and Mode:** Handing missing values by imputing Mean, Median or Mode depending on the requirements.

Before moving on with the main topics of this article let's first do the mandatory thing : **`Starting the PySpark Session`**

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 34 kB/s 
[?25hCollecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 54.6 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=2b898ef4c839ce8cfa4b440d6bb32c40d4b459b5281213629d710658a8bdf9fa
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.3 pyspark-3.2.1


## Starting the PySpark session

In [2]:
from pyspark.sql import SparkSession

null_spark = SparkSession.builder.appName('Handling Missing values using PySpark').getOrCreate()
null_spark

**Note:** This segment I have already covered in detail in my first blog of the PySpark Series - **Getting started with PySpark** so please visit this article before moving forward with this one.

## Reading the dataset

In [7]:
df_null_pyspark = null_spark.read.csv('/content/part2.csv', header = True, inferSchema = True)
df_null_pyspark

DataFrame[Employee Name: string, Age of Employee: int, Experience (in years): int, Salary (per month - $): int]

Breaking down the **read.csv()** function:
This function is basically is sole responsible for reading the CSV formatted data in PySpark.
* 1st parameter: Complete path of the dataset.
* 2nd parameter: **Header-** This will be responsible to make the column name as the column header **when the flag is True**.
* 3rd parameter: **inferScehma-** This will make us show the **orginal data type** of each column **when the flag is True**.

## Displaying the dataset using show() function

In [8]:
df_null_pyspark.show()

+-------------+---------------+---------------------+----------------------+
|Employee Name|Age of Employee|Experience (in years)|Salary (per month - $)|
+-------------+---------------+---------------------+----------------------+
|       Oliver|             31|                   10|                 30000|
|        Harry|             30|                    8|                 25000|
|       George|             29|                    4|                 20000|
|         Jack|             24|                    3|                 20000|
|        Jacob|             21|                    1|                 15000|
|          Leo|             23|                    2|                 18000|
|        Oscar|           null|                 null|                 40000|
|         null|             34|                   10|                 38000|
|         null|             36|                 null|                  null|
+-------------+---------------+---------------------+----------------------+

As mentioned in the Table of the content so we will be working with the **Dummy dataset** to deal with the missing values in this article.

## Dropping the NULL values

Before start dropping the columns with null values let me introduced you with a function that can let us know about which column has null values and which don't
**bold text**
So the function is **`printSchema()`** which works in the same way as **describe()** function of pandas.

In [9]:
df_null_pyspark.printSchema()

root
 |-- Employee Name: string (nullable = true)
 |-- Age of Employee: integer (nullable = true)
 |-- Experience (in years): integer (nullable = true)
 |-- Salary (per month - $): integer (nullable = true)



**Inference:** Here one can see that just after the name of the column of our dataset we can see **nullable = True** which means there are some null values in that column.

For dropping the **Null (na) values** from the dataset we simply use the **na.drop() function** and it will drop all the rows which have even one null value.

In [10]:
df_null_pyspark.na.drop().show()

+-------------+---------------+---------------------+----------------------+
|Employee Name|Age of Employee|Experience (in years)|Salary (per month - $)|
+-------------+---------------+---------------------+----------------------+
|       Oliver|             31|                   10|                 30000|
|        Harry|             30|                    8|                 25000|
|       George|             29|                    4|                 20000|
|         Jack|             24|                    3|                 20000|
|        Jacob|             21|                    1|                 15000|
|          Leo|             23|                    2|                 18000|
+-------------+---------------+---------------------+----------------------+



**Inference:** In the above output we can see that rows that contain the NULL values are dropped.

Previously we saw how to remove the NULL values from rows but we also saw that it removed a complete row even if we have only one NULL value 

So can we control it for some extent that based on some condition only it will remove the null values?

Answer is YES! we can so let's discuss how we can do that.

## "HOW" parameter in na.drop() function

So this paramter is one way where we can decide that in which condition we can skip the NULL values or remove them and while using this parameter we have two options with us let's keep a note of it:

* **HOW = "ANY":** The kind of keywords given to these functionalities is itself a straightforward explanation yet when we will select **ANY** that signifies if **atleast one non-null value** is there then the **no row will be dropped.**

* **HOW = "ALL":** When we will select ALL option that signifies if the row have **all the null values** in its record then only it will drop that row otherwise there would be no effect (column will not drop).

### HOW = "ANY"

In [11]:
df_null_pyspark.na.drop(how="any").show()

+-------------+---------------+---------------------+----------------------+
|Employee Name|Age of Employee|Experience (in years)|Salary (per month - $)|
+-------------+---------------+---------------------+----------------------+
|       Oliver|             31|                   10|                 30000|
|        Harry|             30|                    8|                 25000|
|       George|             29|                    4|                 20000|
|         Jack|             24|                    3|                 20000|
|        Jacob|             21|                    1|                 15000|
|          Leo|             23|                    2|                 18000|
+-------------+---------------+---------------------+----------------------+



**Inference:** As discussed in "any" option it will drop the complete row when there are **more than one NULL value** otherwise row will remain unaffected.

### HOW = "ALL"

In [12]:
df_null_pyspark.na.drop(how="all").show()

+-------------+---------------+---------------------+----------------------+
|Employee Name|Age of Employee|Experience (in years)|Salary (per month - $)|
+-------------+---------------+---------------------+----------------------+
|       Oliver|             31|                   10|                 30000|
|        Harry|             30|                    8|                 25000|
|       George|             29|                    4|                 20000|
|         Jack|             24|                    3|                 20000|
|        Jacob|             21|                    1|                 15000|
|          Leo|             23|                    2|                 18000|
|        Oscar|           null|                 null|                 40000|
|         null|             34|                   10|                 38000|
|         null|             36|                 null|                  null|
+-------------+---------------+---------------------+----------------------+

**Inference:** As discussed in "all" option that it will drop the NULL values only if all the values in one tuple of record is NULL otherwise there will be no change i.e. no row will be dropped and based on that only we can see there is no change in our dataset.

## "THRESH" parameter in na.drop() function

In this parameter we set the threshold value of the **minimum NON NULL values** in a particular row i.e. Suppose if we set the threshold value to **2** then that means the row will be dropped only if the total number of null values exceed **2 otherwise that row will not get dropped.**

In [13]:
df_null_pyspark.na.drop(thresh=2).show()

+-------------+---------------+---------------------+----------------------+
|Employee Name|Age of Employee|Experience (in years)|Salary (per month - $)|
+-------------+---------------+---------------------+----------------------+
|       Oliver|             31|                   10|                 30000|
|        Harry|             30|                    8|                 25000|
|       George|             29|                    4|                 20000|
|         Jack|             24|                    3|                 20000|
|        Jacob|             21|                    1|                 15000|
|          Leo|             23|                    2|                 18000|
|        Oscar|           null|                 null|                 40000|
|         null|             34|                   10|                 38000|
+-------------+---------------+---------------------+----------------------+



**Inference:** Here in this output we can see that our last row has been dropped because it has total 3 null values which exceeded our threshold value and for other rows which have null values either equal to or less than 2 so it won't get dropped.

## "SUBSET" parameter in na.drop() function

This parameter will remind us of the **pandas** as the functionality of this parameter is same as we used to pluck out specific columns from the dataset so here also we will get to know that **how we can draw a subset of specific columns from a complete dataset.**

In [14]:
df_null_pyspark.na.drop(how='any', subset=['Experience (in years)']).show()

+-------------+---------------+---------------------+----------------------+
|Employee Name|Age of Employee|Experience (in years)|Salary (per month - $)|
+-------------+---------------+---------------------+----------------------+
|       Oliver|             31|                   10|                 30000|
|        Harry|             30|                    8|                 25000|
|       George|             29|                    4|                 20000|
|         Jack|             24|                    3|                 20000|
|        Jacob|             21|                    1|                 15000|
|          Leo|             23|                    2|                 18000|
|         null|             34|                   10|                 38000|
+-------------+---------------+---------------------+----------------------+



**Inference:** In the above output we can compare that the null value which was there in the **"Experience (in years)"** columns is sucessfully removed and other than that column no other null value has been dropped as we used the **subset parameter**.

Similarly, if we want the same thing with multiple column we can simply add more of the columns seperated by commas and inside the inverted commas and then we are good to go with multiple columns as well.|

## Filling missing values

This parameter will be responsibile to fill the **missing (NULL) values** in the dataset which is present in **na.fill()** function.

* The first parameter of this function will be the **value** that needs to be **imputed** in the place of missing/ null value.
* Second parameter is where we will mention the **name of the column/columns** on which we want to perform this imputation, this is completely **optional** as if we don't consider it then the imputation will be performed on **whole dataset**.

Let's see the live example of the same.

In [24]:
df_null_pyspark.na.fill('NA values', 'Employee Name').show()

+-------------+---------------+---------------------+----------------------+
|Employee Name|Age of Employee|Experience (in years)|Salary (per month - $)|
+-------------+---------------+---------------------+----------------------+
|       Oliver|             31|                   10|                 30000|
|        Harry|             30|                    8|                 25000|
|       George|             29|                    4|                 20000|
|         Jack|             24|                    3|                 20000|
|        Jacob|             21|                    1|                 15000|
|          Leo|             23|                    2|                 18000|
|        Oscar|           null|                 null|                 40000|
|    NA values|             34|                   10|                 38000|
|    NA values|             36|                 null|                  null|
+-------------+---------------+---------------------+----------------------+

**Inference:** In the above output one can clearly see that I have utilised **both the options** i.e. imputing values as well as on specific column and got the expected results as well.

Note: If we want to perform the above operation of **multiple columns** then we just need to pass the name of those columns in the **list data type.**

### Imputing NA values with central tendency measured

This is basically something of a more professional way to handle the missing values i.e imputing the null values with mean/median/mode depending on the domain of dataset. Here we will be using the Imputer function from PySpark library to use the mean/median/mode functionality.

In [26]:
from pyspark.ml.feature import Imputer

imputer = Imputer(
    inputCols = ['Age of Employee', 'Experience (in years)', 'Salary (per month - $)'],
    outputCols = ["{}_imputed".format(a) for a in ['Age of Employee', 'Experience (in years)', 'Salary (per month - $)']]
).setStrategy("mean")

**Code breakdown:** There is lot of things going on here so let's break it down.

* First we have called the **Imputer function** from **PySpark's ml.feature** library.
* Then using that Imputer object we have defined our **input columns** as well as **output columns** in input columns we gave the name of the column which needs to be imputed and output column is the imputed one.
* Then at the last we **set the strategy** of imputing values (here it's **mean**) but we can either use **median or mode** depending on the dataset.

### Fit and transform

Now so we have used the Imputer object to impute the mean values in the place of null values but to see the changes we need to use the **fit-transform method** simulatenously.

In [28]:
imputer.fit(df_null_pyspark).transform(df_null_pyspark).show()

+-------------+---------------+---------------------+----------------------+-----------------------+-----------------------------+------------------------------+
|Employee Name|Age of Employee|Experience (in years)|Salary (per month - $)|Age of Employee_imputed|Experience (in years)_imputed|Salary (per month - $)_imputed|
+-------------+---------------+---------------------+----------------------+-----------------------+-----------------------------+------------------------------+
|       Oliver|             31|                   10|                 30000|                     31|                           10|                         30000|
|        Harry|             30|                    8|                 25000|                     30|                            8|                         25000|
|       George|             29|                    4|                 20000|                     29|                            4|                         20000|
|         Jack|             

**Inference:** Here we can see that three more columns got added at the last with postfix as **"imputed"** and the Null values are also replaced in those columns with **mean values** for that we have to use the **fit and transform function simultaneously** which will deliberately add the imputed columns in our DataFrame.


**Note:** It's always a good practice to **drop the previous columns** that are still holding the NULL values as it will **hamper the data analysis and machine learning phase**.

## Takeaways from the article

1. First of all we did the mandatory steps which are required whenever we have to work with PySpark i.e. to start the Pyspark session and reading the dataset on which we will be performing the operations.

2. Then we learned how and when to drop the complete columns from the dataset and which functions are required to do so.

3. After knowing how to drop the columns we also came across how to drop the rows from the dataset depending on the business requirements.

4. Then we deep dived into the different paramters of the dropping functions which let us knew that what each parameter was contributing in the function.

5. At the last we learned how to impute the values using either mean, mode or median which is ome of the standard way to deal with missing values.