###Missing Data
Often data sources are incomplete, which means you will have missing data, you have 3 basic options for filling in missing data (you will personally have to make the decision for what is the right approach:

Just keep the missing data points.
Drop them missing data points (including the entire row)
Fill them in with some other value.

####Keeping the missing data
A few machine learning algorithms can easily deal with missing data, let's see what it looks like:

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName("MissingData").getOrCreate()

In [4]:
df=spark.read.csv("G:\Downloads Ex\Python-and-Spark-for-Big-Data-master\Python-and-Spark-for-Big-Data-master\Spark_DataFrames\ContainsNull.csv", inferSchema=True, header=True)

In [5]:
df.show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| NULL|
|emp2| NULL| NULL|
|emp3| NULL|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



###Drop the missing data
You can use the .na functions for missing data. The drop command has the following parameters:

df.na.drop(how='any', thresh=None, subset=None)

* param how: 'any' or 'all'.
    If 'any', drop a row if it contains any nulls.
    If 'all', drop a row only if all its values are null.

* param thresh: int, default None
    If specified, drop rows that have less than `thresh` non-null values.
    This overwrites the `how` parameter.
    
* param subset: 
    optional list of column names to consider.

In [7]:
df.na.drop().show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp4|Cindy|456.0|
+----+-----+-----+



In [8]:
#we can add threshold to make sure not all the data with null values are being dropped. 
df.na.drop(thresh=2).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| NULL|
|emp3| NULL|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



In [10]:
#tell the machine how you would like to drop the null values, all or any i.e. how='any/all'
df.na.drop(how='all').show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| NULL|
|emp2| NULL| NULL|
|emp3| NULL|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



In [12]:
#similarly subset will drop any missing/null values in one purticular column. 
df.na.drop(subset=['Sales']).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp3| NULL|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



In [13]:
#rather than dropping, lets fill the null values. 
df.na.fill('Unnamed', subset=['Name']).show()

+----+-------+-----+
|  Id|   Name|Sales|
+----+-------+-----+
|emp1|   John| NULL|
|emp2|Unnamed| NULL|
|emp3|Unnamed|345.0|
|emp4|  Cindy|456.0|
+----+-------+-----+



In [14]:
#to fill the numeric values with mean values, import reletaed functions
from pyspark.sql.functions import mean

In [19]:
#here we collected the value and assigned to variable mean_value
mean_value = df.select(mean(df['Sales'])).collect()

In [24]:
mean_sales= mean_value[0][0]

In [25]:
#now fill thenull values
df.na.fill(mean_sales, subset=['Sales']).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John|400.5|
|emp2| NULL|400.5|
|emp3| NULL|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



In [26]:
#now let's do these steps all in one line. more faster and optimized way
df.na.fill(df.select(mean(df['Sales'])).collect()[0][0], subset=['Sales']).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John|400.5|
|emp2| NULL|400.5|
|emp3| NULL|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



Dates and timeStamps In the Countinue file of Pyspark_OPerations!! 