### FILLNA()/NA.FILL()
While working with Spark DataFrames, many operations that we typically perform over them may return null values in some of the records. From that point onwards, some other 
operations may result in error if null/empty values are observed and thus we have to somehow replace these values in order to keep processing a DataFrame.

The replacement of null values in PySpark DataFrames is one of the most common operations undertaken. This can be achieved by using either DataFrame.fillna() or 
DataFrame.na.fill() methods.

### DROPNA()
While dealing with a big size Dataframe which consists of many rows and columns they also consist of many NULL or None values at some row or column, or some of the rows are 
totally NULL or None. So in this case, if we apply an operation on the same Dataframe that contains many NULL or None values then we will not get the correct or desired output 
from that Dataframe. For getting the correct output from the Dataframe we have to clean it, which means we have to make Dataframe free of NULL or None values. 

In [0]:
# Creating dataframe

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

details = [
    ("Virat Kholi", 97, 84, 77, 161),
    ("Sachin Tendulkar", 65, None, 65, 65),
    ("Ms Dhoni", 60, 84, None, 84),
    ("Rishab Pant", 88, 90, 74, 164),
    ("Suresh Raina", 70, 50, 70, 120),
    ("Ravindra Jadeja", 50, None, None, 0),
]
columns = StructType(
    [
        StructField(name="Name", dataType=StringType()),
        StructField(name="Attendence", dataType=IntegerType()),
        StructField(name="Subject_1", dataType=StringType()),
        StructField(name="Subject_2", dataType=StringType()),
        StructField(name="Total", dataType=StringType()),
    ]
)

In [0]:
df=spark.createDataFrame(details,columns)

In [0]:
df.printSchema()


root
 |-- Name: string (nullable = true)
 |-- Attendence: integer (nullable = true)
 |-- Subject_1: string (nullable = true)
 |-- Subject_2: string (nullable = true)
 |-- Total: string (nullable = true)



In [0]:
df.display()

Name,Attendence,Subject_1,Subject_2,Total
Virat Kholi,97,84.0,77.0,161
Sachin Tendulkar,65,,65.0,65
Ms Dhoni,60,84.0,,84
Rishab Pant,88,90.0,74.0,164
Suresh Raina,70,50.0,70.0,120
Ravindra Jadeja,50,,,0


#### *Fillna()*
It is used to replace NULL/NONE values on all or selected multiple columns with either zero(0),empty string,space or any oither constant literal values.

In [0]:
# Replacing NULL values with empty data
df.fillna("").display()


Name,Attendence,Subject_1,Subject_2,Total
Virat Kholi,97,84.0,77.0,161
Sachin Tendulkar,65,,65.0,65
Ms Dhoni,60,84.0,,84
Rishab Pant,88,90.0,74.0,164
Suresh Raina,70,50.0,70.0,120
Ravindra Jadeja,50,,,0


In [0]:
# replacing NULL values with a string.
df1=df.fillna("Absent")
df1.display()

Name,Attendence,Subject_1,Subject_2,Total
Virat Kholi,97,84,77,161
Sachin Tendulkar,65,Absent,65,65
Ms Dhoni,60,84,Absent,84
Rishab Pant,88,90,74,164
Suresh Raina,70,50,70,120
Ravindra Jadeja,50,Absent,Absent,0


In [0]:
# replacing Null values on a specific column.We can also mention more than one column.

df.na.fill("Absent", ["Subject_1"]).display()
df.fillna("N/A", ["Subject_1", "subject_2"]).display()

Name,Attendence,Subject_1,Subject_2,Total
Virat Kholi,97,84,77.0,161
Sachin Tendulkar,65,Absent,65.0,65
Ms Dhoni,60,84,,84
Rishab Pant,88,90,74.0,164
Suresh Raina,70,50,70.0,120
Ravindra Jadeja,50,Absent,,0


Name,Attendence,Subject_1,Subject_2,Total
Virat Kholi,97,84.0,77.0,161
Sachin Tendulkar,65,,65.0,65
Ms Dhoni,60,84.0,,84
Rishab Pant,88,90.0,74.0,164
Suresh Raina,70,50.0,70.0,120
Ravindra Jadeja,50,,,0


In [0]:
df.display()

Name,Attendence,Subject_1,Subject_2,Total
Virat Kholi,97,84.0,77.0,161
Sachin Tendulkar,65,,65.0,65
Ms Dhoni,60,84.0,,84
Rishab Pant,88,90.0,74.0,164
Suresh Raina,70,50.0,70.0,120
Ravindra Jadeja,50,,,0


#### *Dropna()*
For cleaning the Dataframe we are using dropna() function. This function is used to drop the NULL values from the Dataframe on the basis of a given parameter.

In [0]:
# It will remove the record if even one Null value is present in the row.

df.dropna(how="any").display()

Name,Attendence,Subject_1,Subject_2,Total
Virat Kholi,97,84,77,161
Rishab Pant,88,90,74,164
Suresh Raina,70,50,70,120


In [0]:
# It will remove the data if all the data is Null in the row. Here we don't have any row will all NulL values so no row is affected. 
df.dropna(how='all').display()

Name,Attendence,Subject_1,Subject_2,Total
Virat Kholi,97,84.0,77.0,161
Sachin Tendulkar,65,,65.0,65
Ms Dhoni,60,84.0,,84
Rishab Pant,88,90.0,74.0,164
Suresh Raina,70,50.0,70.0,120
Ravindra Jadeja,50,,,0


In [0]:
# This will remove the row wherever the Null values are present in the mentioned column. We can also specify more than one column name.
df.dropna(how="any", subset=["subject_1"]).display()
df.dropna(how="any", subset=["subject_1", "subject_2"]).display()

Name,Attendence,Subject_1,Subject_2,Total
Virat Kholi,97,84,77.0,161
Ms Dhoni,60,84,,84
Rishab Pant,88,90,74.0,164
Suresh Raina,70,50,70.0,120


Name,Attendence,Subject_1,Subject_2,Total
Virat Kholi,97,84,77,161
Rishab Pant,88,90,74,164
Suresh Raina,70,50,70,120


#### *thresh => It will check the row whether it has equal to or more than 2 not null values present in the row if not it will remove the row.*

In [0]:
# Using thresh.
df.dropna(how="any", thresh=3).display()
df.dropna(how="any", thresh=4).display()

Name,Attendence,Subject_1,Subject_2,Total
Virat Kholi,97,84.0,77.0,161
Sachin Tendulkar,65,,65.0,65
Ms Dhoni,60,84.0,,84
Rishab Pant,88,90.0,74.0,164
Suresh Raina,70,50.0,70.0,120
Ravindra Jadeja,50,,,0


Name,Attendence,Subject_1,Subject_2,Total
Virat Kholi,97,84.0,77.0,161
Sachin Tendulkar,65,,65.0,65
Ms Dhoni,60,84.0,,84
Rishab Pant,88,90.0,74.0,164
Suresh Raina,70,50.0,70.0,120
