## Missing Data

Types of Missing Data

- Missing Completely at Random (MCAR) : The probability of missing data on a variable is unrelated to any other observed or unobserved variable. It's purely random.
  - One man forgot to answer his weight
  - Missing of IQ score does not depends on Age
- Missing at Random (MAR) : The probability of missing data on a variable is related to some other observed variables but not the variable itself (Y depends on X).
  - Women tend to not disclose their weight.
  - The IQ score for people under 31 years old often doesn't have an answer.
- Missing Not at Random (MNAR) : The probability of missing data on a variable is related to the values of that variable itself, even after controlling for other variables (Y depends on Y).
  - People with more weight tend to not answer this question.
  - People with low IQ score tend to not answer the question.

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('missing data').getOrCreate()

In [2]:
import requests

url = "https://raw.githubusercontent.com/oakabc/DEA/refs/heads/main/7%20-%20Missing%20Data%2C%20Dates%20and%20Timestamp/ContainsNull.csv"
response = requests.get(url)

with open("ContainsNull.csv", "wb") as file:
   file.write(response.content)

# Then read the local file
df = spark.read.csv("ContainsNull.csv", header=True, inferSchema=True)
df.show()


+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| NULL|
|emp2| NULL| NULL|
|emp3| NULL|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



In [3]:
df.printSchema()

root
 |-- Id: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sales: double (nullable = true)



Drop row(s) that contains NULL

In [4]:
df.na.drop().show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp4|Cindy|456.0|
+----+-----+-----+



Drop row(s) that contains NON-NULL that satisfy the threshold

In [5]:
df.na.drop(thresh=2).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| NULL|
|emp3| NULL|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



**Using How (all)**

The how parameter in the df.na.drop() method specifies how rows or columns with missing (null) values should be handled in the DataFrame.

By default, how='any' is used if the how parameter is not explicitly specified.

how='all' means that only rows (or columns) where all values are null will be dropped.
If a row (or column) has at least one non-null value, it will be retained.

In [6]:
df.na.drop(how='all').show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| NULL|
|emp2| NULL| NULL|
|emp3| NULL|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



**Using subset**

The subset parameter in df.na.drop() specifies the columns that PySpark should check for null (or missing) values. Rows with null values in the specified subset of columns will be dropped.

This command checks for null values only in the Sales column.
If a row has a null value in the Sales column, it will be removed, even if other columns in the same row have valid (non-null) values.

In [7]:
df.na.drop(subset = ['Sales']).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp3| NULL|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



**Fill in the Missing Values**

The method df.na.fill('yatta') fills missing (null) values only for columns with a data type of StringType because you are providing a string value ('yatta').
PySpark automatically applies this value only to compatible column types.

In [9]:
df.na.fill('yatta').show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| NULL|
|emp2|yatta| NULL|
|emp3|yatta|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



In [10]:
df.na.fill(0).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John|  0.0|
|emp2| NULL|  0.0|
|emp3| NULL|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



Fill in the missing values in the selected columns

In [12]:
df.na.fill('yatta', subset = ['Name']).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| NULL|
|emp2|yatta| NULL|
|emp3|yatta|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



**Tests**

In [13]:
df.na.fill('yatta', subset = ['Name']).show()
df.na.fill(69, subset = ['Sales']).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| NULL|
|emp2|yatta| NULL|
|emp3|yatta|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| 69.0|
|emp2| NULL| 69.0|
|emp3| NULL|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



Fill in the missing values by means (Means Imputation)

In [14]:
from pyspark.sql.functions import mean
mean_sales = df.select(mean(df['Sales'])).collect()
mean_sales

[Row(avg(Sales)=400.5)]

Extract value from list

In [15]:
mean_sales = mean_sales[0][0]
mean_sales

400.5

In [16]:
df.na.fill(mean_sales, ['Sales']).show()
# df.na.fill(df.select(mean(df['Sales'])).collect()[0][0], ['Sales']).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John|400.5|
|emp2| NULL|400.5|
|emp3| NULL|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



In [17]:
from pyspark.sql.functions import mean
mean_sales = df.select(mean(df['Sales'])).collect()
mean_sales

df.na.fill('son', subset = ['Name']).show()

mean_sales = mean_sales[0][0]
mean_sales

df.na.fill(mean_sales, ['Sales']).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| NULL|
|emp2|  son| NULL|
|emp3|  son|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John|400.5|
|emp2| NULL|400.5|
|emp3| NULL|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



In [18]:
spark.stop()