We have a bronze table being loaded into our data lake using a third-party tool. There has been a request to clean up the data and resolve known issues. Your task is to write the needed Python code that will address each of the following issues.

The following are the issues present:

Wrong column name: The date column is spelled wrong
Nulls not correctly identified: The sales_id column has null values as NA strings
Data with missing values is unwanted: Any data with a null in sales_id should be dropped
Duplicate sales_id: Take the first value of any duplicate rows
Date column not DateType: The date column is not a DateType


In [2]:
from functools import partial
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, regexp_replace, flatten, explode, struct, create_map, array
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, ArrayType, TimestampType

In [3]:
spark = SparkSession.builder.appName('chap-2').master("local[*]").getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/06/26 15:54:19 WARN Utils: Your hostname, Sai-Sundar-MacBook-Pro.local, resolves to a loopback address: 127.0.0.1; using 10.0.0.78 instead (on interface en0)
25/06/26 15:54:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/06/26 15:54:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/06/26 15:54:20 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


----------------------------------------
Exception occurred during processing of request from ('127.0.0.1', 58346)
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/socketserver.py", line 318, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/socketserver.py", line 349, in process_request
    self.finish_request(request, client_address)
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/socketserver.py", line 362, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/socketserver.py", line 761, in __init__
    self.handle()
  File "/Users/saisundarmasetty/Documents/data_architect_ws/chapter2/lib/python3.12/site-packages/pyspark/accumulators.py", line 299, in handle
    poll(accum_updates)
  File "/Users/saisun

In [4]:
bronze_sales = spark.createDataFrame(data = [
    ("1", "LA", "2000-01-01",5, 1400),
    ("2", "LA", "1998-2-01",4, 1500),
    ("2", "LA", "1998-2-01",4, 1500),
    ("3", "LA", "1997-4-01",6, 1300),
    ("4", "LA", "2005-5-01",2, 1100),
    ("NA", "LA", "2013-6-01",1, 1200),
  ], schema = ["sales_id", "city", "dat"," clerk_id", "total_sales"])

In [40]:
#Renaming the date column, here i just create a new df with date column renamed and dat column dropped.

one = bronze_sales.select(col("*"),col("dat").alias("date")).drop("dat")

In [27]:
two = bronze_sales.select(col("*"),when(col('sales_id')== "NA", None).otherwise(col("sales_id")).alias("Cleaned_sales_id"))\
.drop("sales_id").select(col("*"),col("cleaned_sales_id").alias("sales_id")) \
.drop("cleaned_sales_id")

In [28]:
two.show()

+----+----------+---------+-----------+--------+
|city|       dat| clerk_id|total_sales|sales_id|
+----+----------+---------+-----------+--------+
|  LA|2000-01-01|        5|       1400|       1|
|  LA| 1998-2-01|        4|       1500|       2|
|  LA| 1998-2-01|        4|       1500|       2|
|  LA| 1997-4-01|        6|       1300|       3|
|  LA| 2005-5-01|        2|       1100|       4|
|  LA| 2013-6-01|        1|       1200|    NULL|
+----+----------+---------+-----------+--------+



In [29]:
three = two.na.drop(subset=["sales_id"])
three.show()

+----+----------+---------+-----------+--------+
|city|       dat| clerk_id|total_sales|sales_id|
+----+----------+---------+-----------+--------+
|  LA|2000-01-01|        5|       1400|       1|
|  LA| 1998-2-01|        4|       1500|       2|
|  LA| 1998-2-01|        4|       1500|       2|
|  LA| 1997-4-01|        6|       1300|       3|
|  LA| 2005-5-01|        2|       1100|       4|
+----+----------+---------+-----------+--------+



In [30]:
four = three.dropDuplicates(subset=["sales_id"])

In [31]:
four.show()

+----+----------+---------+-----------+--------+
|city|       dat| clerk_id|total_sales|sales_id|
+----+----------+---------+-----------+--------+
|  LA|2000-01-01|        5|       1400|       1|
|  LA| 1998-2-01|        4|       1500|       2|
|  LA| 1997-4-01|        6|       1300|       3|
|  LA| 2005-5-01|        2|       1100|       4|
+----+----------+---------+-----------+--------+



In [41]:
from pyspark.sql.functions import to_date
five = one.select(col("*"),to_date("date").alias("date_fixed"))\
.drop("date") \
.select(col("date_fixed").alias("date"))\
.drop("date_fixed")\
.show()

+----------+
|      date|
+----------+
|2000-01-01|
|1998-02-01|
|1998-02-01|
|1997-04-01|
|2005-05-01|
|2013-06-01|
+----------+



25/06/26 17:15:48 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 1007080 ms exceeds timeout 120000 ms
25/06/26 17:15:48 WARN SparkContext: Killing executors is not supported by current scheduler.
25/06/26 17:15:54 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:53)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:342)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:110)
	at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:36)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.driverEndpoint$lzycompute(BlockManagerMasterEndpoint.scala:132)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$