<a href="https://colab.research.google.com/github/RajaSuhashKesari/MyDataEngineeringPractices/blob/main/Pyspark%20Programs/30%20days%20challenge%20by%20Seekho%20Big%20data%20institute/5.%20How_would_you_handle_null_values_in_a_DataFrame.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").appName("Handling Nulls in the Pyspark").getOrCreate()

In [55]:
employees = [
    (1, "Alice", 34, "F", 50000.0),
    (2, "Bob", None, "M", 45000.0),
    (3, None, 29, "F", None),
    (4, "David", 45, None, 60000.0),
    (5, "Eva", None, "F", 52000.0),
    (6, "Frank", 38, "M", None),
    (7, None, None, None, None),
    (8, "Grace", 28, "F", 48000.0)
]

#columns
columns = ["ID","Name", "Age","Gender","Salary"]

## **Create Dataframe**

In [56]:
employees_df = spark.createDataFrame(employees,columns)
employees_df.show()

+---+-----+----+------+-------+
| ID| Name| Age|Gender| Salary|
+---+-----+----+------+-------+
|  1|Alice|  34|     F|50000.0|
|  2|  Bob|NULL|     M|45000.0|
|  3| NULL|  29|     F|   NULL|
|  4|David|  45|  NULL|60000.0|
|  5|  Eva|NULL|     F|52000.0|
|  6|Frank|  38|     M|   NULL|
|  7| NULL|NULL|  NULL|   NULL|
|  8|Grace|  28|     F|48000.0|
+---+-----+----+------+-------+



## **1.Droping Null rows**

### **1.1 Dropping the row if it contains Null**

In [40]:
employees_df.na.drop().show()

+---+-----+---+------+-------+
| ID| Name|Age|Gender| Salary|
+---+-----+---+------+-------+
|  1|Alice| 34|     F|50000.0|
|  8|Grace| 28|     F|48000.0|
+---+-----+---+------+-------+



### **1.2 Dropping the row if specific column contain null**
#### **Single Column**

In [41]:
employees_df.na.drop(subset=["Name"]).show()

+---+-----+----+------+-------+
| ID| Name| Age|Gender| Salary|
+---+-----+----+------+-------+
|  1|Alice|  34|     F|50000.0|
|  2|  Bob|NULL|     M|45000.0|
|  4|David|  45|  NULL|60000.0|
|  5|  Eva|NULL|     F|52000.0|
|  6|Frank|  38|     M|   NULL|
|  8|Grace|  28|     F|48000.0|
+---+-----+----+------+-------+



In [42]:
employees_df.na.drop(subset=["Salary"]).show()

+---+-----+----+------+-------+
| ID| Name| Age|Gender| Salary|
+---+-----+----+------+-------+
|  1|Alice|  34|     F|50000.0|
|  2|  Bob|NULL|     M|45000.0|
|  4|David|  45|  NULL|60000.0|
|  5|  Eva|NULL|     F|52000.0|
|  8|Grace|  28|     F|48000.0|
+---+-----+----+------+-------+



#### **Multiple Columns**
##### **Delete row if all selected colums have Nulls**

In [43]:
employees_df.na.drop(how = 'all', subset = ["Name", "Age","Salary"]).show()

+---+-----+----+------+-------+
| ID| Name| Age|Gender| Salary|
+---+-----+----+------+-------+
|  1|Alice|  34|     F|50000.0|
|  2|  Bob|NULL|     M|45000.0|
|  3| NULL|  29|     F|   NULL|
|  4|David|  45|  NULL|60000.0|
|  5|  Eva|NULL|     F|52000.0|
|  6|Frank|  38|     M|   NULL|
|  8|Grace|  28|     F|48000.0|
+---+-----+----+------+-------+



##### **Delete row if any of selected colums have Nulls**

In [44]:
employees_df.na.drop(how = 'any', subset = ["Name", "Age","Salary"]).show()

+---+-----+---+------+-------+
| ID| Name|Age|Gender| Salary|
+---+-----+---+------+-------+
|  1|Alice| 34|     F|50000.0|
|  4|David| 45|  NULL|60000.0|
|  8|Grace| 28|     F|48000.0|
+---+-----+---+------+-------+



## **2.Fill Null values using na.fill() or fillna()**

### **2.1 Filling nulls by specific value**

In [45]:
employees_df.na.fill(0).show()

+---+-----+---+------+-------+
| ID| Name|Age|Gender| Salary|
+---+-----+---+------+-------+
|  1|Alice| 34|     F|50000.0|
|  2|  Bob|  0|     M|45000.0|
|  3| NULL| 29|     F|    0.0|
|  4|David| 45|  NULL|60000.0|
|  5|  Eva|  0|     F|52000.0|
|  6|Frank| 38|     M|    0.0|
|  7| NULL|  0|  NULL|    0.0|
|  8|Grace| 28|     F|48000.0|
+---+-----+---+------+-------+



In [46]:
employees_df.na.fill('NA').show()

+---+-----+----+------+-------+
| ID| Name| Age|Gender| Salary|
+---+-----+----+------+-------+
|  1|Alice|  34|     F|50000.0|
|  2|  Bob|NULL|     M|45000.0|
|  3|   NA|  29|     F|   NULL|
|  4|David|  45|    NA|60000.0|
|  5|  Eva|NULL|     F|52000.0|
|  6|Frank|  38|     M|   NULL|
|  7|   NA|NULL|    NA|   NULL|
|  8|Grace|  28|     F|48000.0|
+---+-----+----+------+-------+



### **2.2 Filling nulls using dictionary method.**

In [47]:
employees_df.na.fill({'Name':'Unknown','Age':0,'Gender':'NA','Salary':0}).show()

+---+-------+---+------+-------+
| ID|   Name|Age|Gender| Salary|
+---+-------+---+------+-------+
|  1|  Alice| 34|     F|50000.0|
|  2|    Bob|  0|     M|45000.0|
|  3|Unknown| 29|     F|    0.0|
|  4|  David| 45|    NA|60000.0|
|  5|    Eva|  0|     F|52000.0|
|  6|  Frank| 38|     M|    0.0|
|  7|Unknown|  0|    NA|    0.0|
|  8|  Grace| 28|     F|48000.0|
+---+-------+---+------+-------+



In [48]:
employees_df.fillna({'Name':'Unknown','Age':0,'Gender':'NA','Salary':0}).show()

+---+-------+---+------+-------+
| ID|   Name|Age|Gender| Salary|
+---+-------+---+------+-------+
|  1|  Alice| 34|     F|50000.0|
|  2|    Bob|  0|     M|45000.0|
|  3|Unknown| 29|     F|    0.0|
|  4|  David| 45|    NA|60000.0|
|  5|    Eva|  0|     F|52000.0|
|  6|  Frank| 38|     M|    0.0|
|  7|Unknown|  0|    NA|    0.0|
|  8|  Grace| 28|     F|48000.0|
+---+-------+---+------+-------+



## **3.Removing Null row by filter()**

### **3.1 getting the where the row doesn't contains nulls**

In [49]:
employees_df.filter(employees_df.Salary.isNotNull()).show()

+---+-----+----+------+-------+
| ID| Name| Age|Gender| Salary|
+---+-----+----+------+-------+
|  1|Alice|  34|     F|50000.0|
|  2|  Bob|NULL|     M|45000.0|
|  4|David|  45|  NULL|60000.0|
|  5|  Eva|NULL|     F|52000.0|
|  8|Grace|  28|     F|48000.0|
+---+-----+----+------+-------+



In [50]:
employees_df.filter(employees_df.Name.isNotNull() & employees_df.Salary.isNotNull()).show()

+---+-----+----+------+-------+
| ID| Name| Age|Gender| Salary|
+---+-----+----+------+-------+
|  1|Alice|  34|     F|50000.0|
|  2|  Bob|NULL|     M|45000.0|
|  4|David|  45|  NULL|60000.0|
|  5|  Eva|NULL|     F|52000.0|
|  8|Grace|  28|     F|48000.0|
+---+-----+----+------+-------+



### **3.2 getting the rows where the row contains nulls**

In [51]:
employees_df.filter(employees_df.Name.isNull() | employees_df.Salary.isNull() | employees_df.Gender.isNull() | employees_df.Age.isNull()).show()

+---+-----+----+------+-------+
| ID| Name| Age|Gender| Salary|
+---+-----+----+------+-------+
|  2|  Bob|NULL|     M|45000.0|
|  3| NULL|  29|     F|   NULL|
|  4|David|  45|  NULL|60000.0|
|  5|  Eva|NULL|     F|52000.0|
|  6|Frank|  38|     M|   NULL|
|  7| NULL|NULL|  NULL|   NULL|
+---+-----+----+------+-------+



## **4.Fill nulls using when()**

In [52]:
from pyspark.sql.functions import when, col

#### **filling null name with Unknown**

In [57]:
employees_df = employees_df.withColumn("Name",when(col("Name").isNull(),"Unknown").otherwise(col("Name")))
employees_df.select("Name").show()

+-------+
|   Name|
+-------+
|  Alice|
|    Bob|
|Unknown|
|  David|
|    Eva|
|  Frank|
|Unknown|
|  Grace|
+-------+



In [58]:
employees_df = employees_df.withColumn("Salary",when(col("Salary").isNull(),0).otherwise(col("Salary")))
employees_df.select("Salary").show()

+-------+
| Salary|
+-------+
|50000.0|
|45000.0|
|    0.0|
|60000.0|
|52000.0|
|    0.0|
|    0.0|
|48000.0|
+-------+

