#  IBM - PySpark Interview Question 

You are working as a Data Engineer for a company. The sales team has provided you with a dataset containing sales information. However, the data has some missing values that need to be addressed before processing. You are required to perform the following tasks:
1. Load the following sample dataset into a PySpark DataFrame:

2. Perform the following operations:
* Replace all NULL values in the Quantity column with 0.
* Replace all NULL values in the Price column with the average price of the existing data.
* Drop rows where the Product column is NULL.
* Fill missing Sales_Date with a default value of '2025-01-01'.
* Drop rows where all columns are NULL.


In [0]:
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import *

In [0]:
data = [ (1, "Laptop", 10, 50000, "North", "2025-01-01"), (2, "Mobile", None, 15000, "South", None), (3, "Tablet", 20, None, "West", "2025-01-03"), (4, "Desktop", 15, 30000, None, "2025-01-04"), (5, None, None, None, "East", "2025-01-05") ] 

columns = ["Sales_ID", "Product", "Quantity", "Price", "Region", "Sales_Date"]

In [0]:
df = spark.createDataFrame(data, columns)
df.display()

Sales_ID,Product,Quantity,Price,Region,Sales_Date
1,Laptop,10.0,50000.0,North,2025-01-01
2,Mobile,,15000.0,South,
3,Tablet,20.0,,West,2025-01-03
4,Desktop,15.0,30000.0,,2025-01-04
5,,,,East,2025-01-05


### Replace all NULL values in the Quantity column with 0.

In [0]:
df = df.fillna({'Quantity': 0})


### Replace all NULL values in the Price column with the average price of the existing data.

In [0]:
avg_data = df.agg(mean('Price')).collect()[0][0]
avg_data

Out[26]: 31666.666666666668

In [0]:
df = df.fillna({'Price': avg_data})

### Drop rows where the Product column is NULL.

In [0]:
df_not_null_products = df.filter( ~isnull('Product') )
df_not_null_products.display()

Sales_ID,Product,Quantity,Price,Region,Sales_Date
1,Laptop,10,50000,North,2025-01-01
2,Mobile,0,15000,South,
3,Tablet,20,31666,West,2025-01-03
4,Desktop,15,30000,,2025-01-04


### Fill missing Sales_Date with a default value of '2025-01-01'.

In [0]:
df_not_null_products = df_not_null_products.fillna({'Sales_Date': '2025-01-01'})
df_not_null_products.display()

Sales_ID,Product,Quantity,Price,Region,Sales_Date
1,Laptop,10,50000,North,2025-01-01
2,Mobile,0,15000,South,2025-01-01
3,Tablet,20,31666,West,2025-01-03
4,Desktop,15,30000,,2025-01-04


### Drop rows where all columns are NULL.

In [0]:
df_not_null_products = df_not_null_products.na.drop(how='all')
df_not_null_products.display()

Sales_ID,Product,Quantity,Price,Region,Sales_Date
1,Laptop,10,50000,North,2025-01-01
2,Mobile,0,15000,South,2025-01-01
3,Tablet,20,31666,West,2025-01-03
4,Desktop,15,30000,,2025-01-04
