# **Missing_Data**

Often data sources are incomplete, which means you will have missing data, you have 3 basic options for filling in missing data (you will personally have to make the decision for what is the right approach:

Just keep the missing data points.
Drop them missing data points (including the entire row)
Fill them in with some other value.
Let's cover examples of each of these methods!

**Keeping the missing data**

A few machine learning algorithms can easily deal with missing data, let's see what it looks like

In [None]:
pip install pyspark

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('missing_data').getOrCreate()

In [3]:
df = spark.read.csv('/content/drive/MyDrive/Spark_DataFrames/ContainsNull.csv',inferSchema=True,header=True)

In [4]:
df.printSchema()

root
 |-- Id: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sales: double (nullable = true)



In [8]:
df.show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| null|
|emp2| null| null|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



**Drop the missing data**

In [14]:
# df.na.drop(how='any', thresh=None, subset=None)

#    If 'any', drop a row if it contains any nulls.
#    If 'all', drop a row only if all its values are null.

print(df.na.drop(how='any').show())
print(df.na.drop(how='all').show())

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp4|Cindy|456.0|
+----+-----+-----+

None
+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| null|
|emp2| null| null|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+

None


In [12]:
# Drop rows that have at least 2 missing values

df.na.drop(thresh=2).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| null|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



In [13]:
# Drops rows with missing values in a particular subset

df.na.drop(subset = ['Sales']).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



**Filling the missing values**

We can also fill the missing values with new values. If you have multiple nulls across multiple data types, Spark is actually smart enough to match up the data types. For example:

In [15]:
# only fills where the data type is string

df.na.fill('New_Value').show()

+----+---------+-----+
|  Id|     Name|Sales|
+----+---------+-----+
|emp1|     John| null|
|emp2|New_Value| null|
|emp3|New_Value|345.0|
|emp4|    Cindy|456.0|
+----+---------+-----+



In [16]:
df.na.fill(0).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John|  0.0|
|emp2| null|  0.0|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



In [17]:
# specifying where to fill

df.na.fill('No name',subset=['Name']).show()

+----+-------+-----+
|  Id|   Name|Sales|
+----+-------+-----+
|emp1|   John| null|
|emp2|No name| null|
|emp3|No name|345.0|
|emp4|  Cindy|456.0|
+----+-------+-----+



Using functions to fill:

In [18]:
from pyspark.sql.functions import mean

In [25]:
df.select(mean('Sales')).collect()[0][0]

400.5

In [26]:
df.na.fill(df.select(mean('Sales')).collect()[0][0],subset=['Sales']).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John|400.5|
|emp2| null|400.5|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+

