Connecting to the bronze container to retrieve the raw data

In [0]:
dbutils.fs.ls("/mnt/bronze")

[FileInfo(path='dbfs:/mnt/bronze/customer.csv', name='customer.csv', size=65579, modificationTime=1714665014000),
 FileInfo(path='dbfs:/mnt/bronze/products.csv', name='products.csv', size=144540, modificationTime=1714665027000),
 FileInfo(path='dbfs:/mnt/bronze/stores.csv', name='stores.csv', size=904638, modificationTime=1714665041000)]

Checking for data in silver container

In [0]:
dbutils.fs.ls("/mnt/silver")

[FileInfo(path='dbfs:/mnt/silver/df_customer/', name='df_customer/', size=0, modificationTime=1714685212000),
 FileInfo(path='dbfs:/mnt/silver/df_product/', name='df_product/', size=0, modificationTime=1714685219000),
 FileInfo(path='dbfs:/mnt/silver/df_store/', name='df_store/', size=0, modificationTime=1714685221000)]

Assigning paths of customer, product, store tables

In [0]:
customers_path = '/mnt/bronze/customer.csv'
products_path = '/mnt/bronze/products.csv'
store_path = '/mnt/bronze/stores.csv'

Loading Data into DataFrame from CSV

In [0]:
df_customer = spark.read.format('csv').option('header', 'true').load(customers_path)
df_product = spark.read.format('csv').option('header', 'true').load(products_path)
df_store = spark.read.format('csv').option('header', 'true').load(store_path)

Understanding data in each table

In [0]:
display(df_customer.head(5))

Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Age
CG/12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420,South,42
DV/13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036,West,47
SO/20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311,South,19
BH/11710,Brosina Hoffman,Consumer,United States,Los Angeles,California,90032,West,39
AA/10480,Andrew Allen,Consumer,United States,Concord,North Carolina,28027,South,31


In [0]:
df_customer.count()

793

Checking summary of age column in customer table

In [0]:
df_customer.select("Age").describe().show()

+-------+------------------+
|summary|               Age|
+-------+------------------+
|  count|               793|
|   mean|33.746532156368225|
| stddev| 8.628123278990762|
|    min|                19|
|    max|                48|
+-------+------------------+



In [0]:
display(df_product.head(5))

Product_ID,Category,Sub_Category,Product_Name
FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase
FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back"
OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters by Universal
FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table
OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System


In [0]:
df_product.count()

1861

In [0]:
display(df_store.head(5))

Row_ID,Order_ID,Order_Date,Ship_Date,Ship_Mode,Customer_ID,Product_ID,Sales,Discount
1,CA-2017-152156,11/8/2017,11/11/2017,Second Class,CG/12520,FUR-BO-10001798,3929400.0,0.02
2,CA-2017-152156,11/8/2017,11/11/2017,Second Class,CG/12520,FUR-CH-10000454,10979100.0,0.01
3,CA-2017-138688,6/12/2017,6/16/2017,Second Class,DV/13045,OFF-LA-10000240,219300.0,0.01
4,US-2016-108966,10/11/2016,10/18/2016,Standard Class,SO/20335,FUR-TA-10000577,14363662.5,0.02
5,US-2016-108966,10/11/2016,10/18/2016,Standard Class,SO/20335,OFF-ST-10000760,335520.0,0.03


In [0]:
df_store.count()

9800

Checking for null values in each dataframe

In [0]:
from pyspark.sql.functions import count, when, isnull, col

dfs = [df_customer, df_product, df_store]
df_names = ["df_customer", "df_product", "df_store"]

for i, df in enumerate(dfs, start=1):
    print(f"Null values in {df_names[i-1]}:")
    df.select([count(when(isnull(col), col)).alias(col) for col in df.columns]).show()


Null values in df_customer:
+-----------+-------------+-------+-------+----+-----+-----------+------+---+
|Customer_ID|Customer_Name|Segment|Country|City|State|Postal_Code|Region|Age|
+-----------+-------------+-------+-------+----+-----+-----------+------+---+
|          0|            0|      0|      0|   0|    0|          0|     0|  0|
+-----------+-------------+-------+-------+----+-----+-----------+------+---+

Null values in df_product:
+----------+--------+------------+------------+
|Product_ID|Category|Sub_Category|Product_Name|
+----------+--------+------------+------------+
|         0|       0|           0|           0|
+----------+--------+------------+------------+

Null values in df_store:
+------+--------+----------+---------+---------+-----------+----------+-----+--------+
|Row_ID|Order_ID|Order_Date|Ship_Date|Ship_Mode|Customer_ID|Product_ID|Sales|Discount|
+------+--------+----------+---------+---------+-----------+----------+-----+--------+
|     0|       0|         0

As we can see, we found 30 records in the store dataframe with null values

In [0]:
df_store.where(col("Sales").isNull()).show()

+------+--------------+----------+----------+--------------+-----------+---------------+-----+--------+
|Row_ID|      Order_ID|Order_Date| Ship_Date|     Ship_Mode|Customer_ID|     Product_ID|Sales|Discount|
+------+--------------+----------+----------+--------------+-----------+---------------+-----+--------+
|    76|US-2018-118038| 12/9/2018|12/11/2018|   First Class|   KB/16600|OFF-BI-10004182| NULL|    0.01|
|   977|US-2018-100209|  7/9/2018| 7/15/2018|Standard Class|   TD/20995|OFF-BI-10002012| NULL|    0.01|
|   988|CA-2016-146829| 3/10/2016| 3/10/2016|      Same Day|   TS/21340|OFF-BI-10004022| NULL|    0.01|
|  1113|US-2017-110156|11/19/2017|11/24/2017|Standard Class|   EH/13945|OFF-BI-10002609| NULL|    0.03|
|  1333|CA-2015-122567| 2/16/2015| 2/21/2015|Standard Class|   MN/17935|OFF-BI-10002012| NULL|    0.03|
|  1686|CA-2018-149489| 4/24/2018| 4/27/2018|   First Class|   DK/12835|OFF-BI-10002813| NULL|    0.02|
|  2107|US-2015-152723| 9/26/2015| 9/26/2015|      Same Day|   H

Checking null values count in store dataframe

In [0]:
df_store.where(col("Sales").isNull()).count()

30

Dropping records with null values

In [0]:
df_store = df_store.dropna(subset=["Sales"])
df_store.show()

+------+--------------+----------+----------+--------------+-----------+---------------+----------+--------+
|Row_ID|      Order_ID|Order_Date| Ship_Date|     Ship_Mode|Customer_ID|     Product_ID|     Sales|Discount|
+------+--------------+----------+----------+--------------+-----------+---------------+----------+--------+
|     1|CA-2017-152156| 11/8/2017|11/11/2017|  Second Class|   CG/12520|FUR-BO-10001798|   3929400|    0.02|
|     2|CA-2017-152156| 11/8/2017|11/11/2017|  Second Class|   CG/12520|FUR-CH-10000454|  10979100|    0.01|
|     3|CA-2017-138688| 6/12/2017| 6/16/2017|  Second Class|   DV/13045|OFF-LA-10000240|    219300|    0.01|
|     4|US-2016-108966|10/11/2016|10/18/2016|Standard Class|   SO/20335|FUR-TA-10000577|14363662.5|    0.02|
|     5|US-2016-108966|10/11/2016|10/18/2016|Standard Class|   SO/20335|OFF-ST-10000760|    335520|    0.03|
|     6|CA-2015-115812|  6/9/2015| 6/14/2015|Standard Class|   BH/11710|FUR-FU-10001487|    732900|    0.02|
|     7|CA-2015-115

Checking null values count in store dataframe after removing records with null values

In [0]:
df_store.where(col("Sales").isNull()).count()

0

Checking for duplicate records in each dataframe

In [0]:
dfs = [df_customer, df_product, df_store]
df_names = ["df_customer", "df_product", "df_store"]

for i, df in enumerate(dfs, start=1):
    original_count = df.count()
    no_duplicates_count = df.dropDuplicates().count()
    
    if no_duplicates_count < original_count:
        print(f"Duplicates found in {df_names[i-1]}.")
    else:
        print(f"No duplicates found in {df_names[i-1]}.")

No duplicates found in df_customer.
No duplicates found in df_product.
No duplicates found in df_store.


Checking data type of each column in customer

In [0]:
df_customer.printSchema()

root
 |-- Customer_ID: string (nullable = true)
 |-- Customer_Name: string (nullable = true)
 |-- Segment: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Postal_Code: string (nullable = true)
 |-- Region: string (nullable = true)
 |-- Age: string (nullable = true)



Changing age column datatype

In [0]:
df_customer = df_customer.withColumn("Age", col("Age").cast("integer"))
df_customer.printSchema()

root
 |-- Customer_ID: string (nullable = true)
 |-- Customer_Name: string (nullable = true)
 |-- Segment: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Postal_Code: string (nullable = true)
 |-- Region: string (nullable = true)
 |-- Age: integer (nullable = true)



Checking data type of each column in Store

In [0]:
df_store.printSchema()

root
 |-- Row_ID: string (nullable = true)
 |-- Order_ID: string (nullable = true)
 |-- Order_Date: string (nullable = true)
 |-- Ship_Date: string (nullable = true)
 |-- Ship_Mode: string (nullable = true)
 |-- Customer_ID: string (nullable = true)
 |-- Product_ID: string (nullable = true)
 |-- Sales: string (nullable = true)
 |-- Discount: string (nullable = true)



Changing Sales, Discount columns datatypes

In [0]:
from pyspark.sql.functions import col

for col_name in ['Sales', 'Discount']:
    df_store = df_store.withColumn(col_name, col(col_name).cast("double"))

df_store.printSchema()

root
 |-- Row_ID: string (nullable = true)
 |-- Order_ID: string (nullable = true)
 |-- Order_Date: string (nullable = true)
 |-- Ship_Date: string (nullable = true)
 |-- Ship_Mode: string (nullable = true)
 |-- Customer_ID: string (nullable = true)
 |-- Product_ID: string (nullable = true)
 |-- Sales: double (nullable = true)
 |-- Discount: double (nullable = true)



Removing Row_ID column from Store table

In [0]:
df_store = df_store.drop('Row_ID')
display(df_store.head(5))

Order_ID,Order_Date,Ship_Date,Ship_Mode,Customer_ID,Product_ID,Sales,Discount
CA-2017-152156,11/8/2017,11/11/2017,Second Class,CG/12520,FUR-BO-10001798,3929400.0,0.02
CA-2017-152156,11/8/2017,11/11/2017,Second Class,CG/12520,FUR-CH-10000454,10979100.0,0.01
CA-2017-138688,6/12/2017,6/16/2017,Second Class,DV/13045,OFF-LA-10000240,219300.0,0.01
US-2016-108966,10/11/2016,10/18/2016,Standard Class,SO/20335,FUR-TA-10000577,14363662.5,0.02
US-2016-108966,10/11/2016,10/18/2016,Standard Class,SO/20335,OFF-ST-10000760,335520.0,0.03


Changing datatype of Order_Date and Ship_Date columns from string to Date

In [0]:
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")

from pyspark.sql.functions import to_date

df_store = df_store.withColumn("Order_Date", to_date(df_store["Order_Date"], "MM/dd/yyyy"))
df_store = df_store.withColumn("Ship_Date", to_date(df_store["Ship_Date"], "MM/dd/yyyy"))
display(df_store)

Order_ID,Order_Date,Ship_Date,Ship_Mode,Customer_ID,Product_ID,Sales,Discount
CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG/12520,FUR-BO-10001798,3929400.0,0.02
CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG/12520,FUR-CH-10000454,10979100.0,0.01
CA-2017-138688,2017-06-12,2017-06-16,Second Class,DV/13045,OFF-LA-10000240,219300.0,0.01
US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO/20335,FUR-TA-10000577,14363662.5,0.02
US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO/20335,OFF-ST-10000760,335520.0,0.03
CA-2015-115812,2015-06-09,2015-06-14,Standard Class,BH/11710,FUR-FU-10001487,732900.0,0.02
CA-2015-115812,2015-06-09,2015-06-14,Standard Class,BH/11710,OFF-AR-10002833,109200.0,0.02
CA-2015-115812,2015-06-09,2015-06-14,Standard Class,BH/11710,TEC-PH-10002275,13607280.0,0.02
CA-2015-115812,2015-06-09,2015-06-14,Standard Class,BH/11710,OFF-BI-10003910,277560.0,0.02
CA-2015-115812,2015-06-09,2015-06-14,Standard Class,BH/11710,OFF-AP-10002892,1723500.0,0.01


Saving cleaned data in the silver container

In [0]:
dfs = [df_customer, df_product, df_store]
df_names = ["df_customer", "df_product", "df_store"]

for i, df in enumerate(dfs, start=1):
  output_path = '/mnt/silver/' + df_names[i-1] + '/'
  df.write.format('delta').mode("overwrite").save(output_path)