# 🥈 **Silver Layer: Data Cleaning & Standardization**

## **1. Overview**
The **Silver Layer** represents the "Cleaned" and "Validated" state of our data. In this stage, we transform the raw Bronze data into a queryable format by applying a strict sequence of cleaning and casting operations.



## **Key Ingestion Summary**
| Step | Operation | Goal |
| :--- | :--- | :--- |
| 1 | **Rename** | Schema Compatibility |
| 2 | **Trim** | Data Consistency |
| 3 | **Date Cast** | Temporal Accuracy |
| 4 | **Upper Case** | Category Merging |
| 5 | **Numeric Cast** | Computational Readiness |

---

> **Note:** This multi-step process ensures that the Silver Layer provides a "Single Source of Truth" for the Sales dataset, ready for the Gold Layer aggregation.

In [0]:
#intialization 
import pandas as pd 
import pyspark.sql.functions as f 

#laod data and convert to pandas data frame 
sales = spark.read.table("sales.bronz_layer.sales",)
sales = sales.toPandas()
print(type(sales))
sales.head(5)

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,OrderNumber,Sales_Channel,WarehouseCode,ProcuredDate,OrderDate,ShipDate,DeliveryDate,CurrencyCode,_SalesTeamID,_CustomerID,_StoreID,_ProductID,Order_Quantity,Discount_Applied,Unit_Price,Unit_Cost,ingestion_data
0,SO - 0006094,In-Store,WARE-NMK1003,12/1/2019,5/12/2020,6/4/2020,6/5/2020,USD,11.0,2.0,148.0,25.0,4.0,0.05,1199.3,791.54,2026-02-02 12:53:31.044491
1,SO - 0006095,In-Store,WARE-UHY1004,3/10/2020,5/13/2020,5/21/2020,5/23/2020,USD,9.0,20.0,215.0,13.0,3.0,0.15,783.9,642.8,2026-02-02 12:53:31.044491
2,SO - 0006096,In-Store,WARE-MKL1006,3/10/2020,5/13/2020,5/24/2020,5/29/2020,USD,2.0,48.0,349.0,32.0,2.0,0.1,268.0,198.32,2026-02-02 12:53:31.044491
3,SO - 0006097,Online,WARE-PUJ1005,12/1/2019,5/13/2020,5/19/2020,5/20/2020,USD,13.0,27.0,292.0,28.0,6.0,0.075,2010.0,1386.9,2026-02-02 12:53:31.044491
4,SO - 0006098,Wholesale,WARE-NMK1003,12/1/2019,5/13/2020,5/25/2020,5/26/2020,USD,26.0,24.0,140.0,34.0,1.0,0.15,3973.1,3377.14,2026-02-02 12:53:31.044491


## **2. Transformation Pipeline Sequence**
We have applied the following transformations in order to ensure the data is high-quality and consistent:


### 🔄 **Phase 1: Structural Cleanup (Renaming)**
* **Action**: Standardized all column names.
* **Logic**: Replaced spaces with underscores (`_`) and ensured consistent naming conventions to satisfy **Delta Lake** schema requirements.


In [0]:
sales = sales.rename({
    'OrderNumber': 'Order_Number','SalesChannel': 'Sales_Channel','WarehouseCode': 'Warehouse_Code','ProcuredDate': 'Purchased_Date',
    'OrderDate': 'Order_Date','ShipDate': 'Ship_Date','DeliveryDate': 'Delivery_Date','CurrencyCode': 'Currency_Code',
    '_SalesTeamID': 'Sales_Team_ID','_CustomerID': 'Customer_ID','_StoreID': 'Store_ID','_ProductID': 'Product_ID',
    'OrderQuantity': 'Order_Quantity','DiscountApplied': 'Discount_Applied','UnitPrice': 'Unit_Price','UnitCost': 'Unit_Cost'
}, axis = 1)
print(sales.columns)

Index(['Order_Number', 'Sales_Channel', 'Warehouse_Code', 'Purchased_Date',
       'Order_Date', 'Ship_Date', 'Delivery_Date', 'Currency_Code',
       'Sales_Team_ID', 'Customer_ID', 'Store_ID', 'Product_ID',
       'Order_Quantity', 'Discount_Applied', 'Unit_Price', 'Unit_Cost',
       'ingestion_data'],
      dtype='object')


### ✂️ **Phase 2: String Refining (Trimming)**
* **Action**: Global whitespace removal.
* **Logic**: Applied a `strip()` function across all text-based columns. This eliminates hidden leading or trailing spaces that often cause "Join Misses" between tables.

In [0]:
# get columns with string dattypes first
string_columns = sales.select_dtypes(include= ['object']).columns
print(f"those are columns with spaces :  {string_columns}")

# remove spaces from those columns 
sales[string_columns] = sales[string_columns].apply(lambda x : x.str.strip())

#check which columns still have white spaces 
still_having_columns = [
    col for col in string_columns
    if sales[col].dropna().apply(lambda x : str(x).strip() != str(x)).any()
]
print(f"Columns that still have spaces: {still_having_columns}")

those are columns with spaces :  Index(['Order_Number', 'Sales_Channel', 'Warehouse_Code', 'Purchased_Date',
       'Order_Date', 'Ship_Date', 'Delivery_Date', 'Currency_Code',
       'Unit_Price', 'Unit_Cost'],
      dtype='object')
Columns that still have spaces: []


In [0]:
summary = pd.DataFrame(
    {
        'Dtype': sales.dtypes,
        'sample value': sales.iloc[0],
        'Unique value': sales.nunique()
    })
print(summary)

                           Dtype                sample value  Unique value
Order_Number              object                SO - 0006094          7991
Sales_Channel             object                    In-Store             4
Warehouse_Code            object                WARE-NMK1003             6
Purchased_Date            object                   12/1/2019            11
Order_Date                object                   5/12/2020           945
Ship_Date                 object                    6/4/2020           966
Delivery_Date             object                    6/5/2020           966
Currency_Code             object                         USD             6
Sales_Team_ID            float64                        11.0            28
Customer_ID              float64                         2.0            50
Store_ID                 float64                       148.0           367
Product_ID               float64                        25.0            47
Order_Quantity           

### 📅 **Phase 3: Temporal Standardization (Dates)**
* **Action**: Converted string objects to **Date** format.
* **Logic**: Transformed `Purchased_Date`, `Order_Date`, `Ship_Date`, and `Delivery_Date`. By removing the time component, we simplify calendar-based reporting.


In [0]:
# convert dates 
date_columns  = ['Purchased_Date','Ship_Date','Order_Date','Delivery_Date']
for col in date_columns: 
   sales[col] = pd.to_datetime(sales[col])

print(sales[date_columns].head())
sales.info()

  Purchased_Date  Ship_Date Order_Date Delivery_Date
0     2019-12-01 2020-06-04 2020-05-12    2020-06-05
1     2020-03-10 2020-05-21 2020-05-13    2020-05-23
2     2020-03-10 2020-05-24 2020-05-13    2020-05-29
3     2019-12-01 2020-05-19 2020-05-13    2020-05-20
4     2019-12-01 2020-05-25 2020-05-13    2020-05-26
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7991 entries, 0 to 7990
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Order_Number      7991 non-null   object        
 1   Sales_Channel     7991 non-null   object        
 2   Warehouse_Code    7991 non-null   object        
 3   Purchased_Date    7991 non-null   datetime64[ns]
 4   Order_Date        7991 non-null   datetime64[ns]
 5   Ship_Date         7991 non-null   datetime64[ns]
 6   Delivery_Date     7991 non-null   datetime64[ns]
 7   Currency_Code     7991 non-null   object        
 8   Sales_Team_ID     7991 non-nul

### 🔠 **Phase 4: Categorical Normalization (Upper Case)**
* **Action**: Case standardization for `Currency_Code`.
* **Logic**: Forced all codes to **UPPERCASE** (e.g., `usd` → `USD`). This ensures that grouping and filtering by currency remain accurate.

In [0]:
print("Unique Currency Codes ")
print(sales['Currency_Code'].value_counts())

# change all letters to be Upper letters 
sales['Currency_Code'] = sales['Currency_Code'].str.upper()

print("Unique Currency codes after cleaning")
print(sales['Currency_Code'].value_counts())

Unique Currency Codes 
Currency_Code
USD    7974
usd       7
Usd       5
usD       2
uSD       2
UsD       1
Name: count, dtype: int64
Unique Currency codes after cleaning
Currency_Code
USD    7991
Name: count, dtype: int64


### 🔢 **Phase 5: Type Casting (Numeric Optimization)**
* **IDs & Quantities → `Integer`**: Converted `Customer_ID`, `Store_ID`, `Product_ID`, and `Order_Quantity` to integers to remove unnecessary decimals and improve performance.
* **Financials → `Float`**: Cleaned and converted `Unit_Price` and `Unit_Cost` into floats, allowing for mathematical calculations like Profit and Margin.

In [0]:
# change id columns to be integer 
id_columns =['Sales_Team_ID','Customer_ID','Product_ID','Store_ID','Order_Quantity']
sales[id_columns]= sales[id_columns].astype(int)

# change prices to be float
price_columns = ['Unit_Price','Unit_Cost']
sales[price_columns] = sales[price_columns].replace(',', '',regex=True).astype(float)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7991 entries, 0 to 7990
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Order_Number      7991 non-null   object        
 1   Sales_Channel     7991 non-null   object        
 2   Warehouse_Code    7991 non-null   object        
 3   Purchased_Date    7991 non-null   datetime64[ns]
 4   Order_Date        7991 non-null   datetime64[ns]
 5   Ship_Date         7991 non-null   datetime64[ns]
 6   Delivery_Date     7991 non-null   datetime64[ns]
 7   Currency_Code     7991 non-null   object        
 8   Sales_Team_ID     7991 non-null   int64         
 9   Customer_ID       7991 non-null   int64         
 10  Store_ID          7991 non-null   int64         
 11  Product_ID        7991 non-null   int64         
 12  Order_Quantity    7991 non-null   int64         
 13  Discount_Applied  7991 non-null   float64       
 14  Unit_Price        7991 n

## **Sanity check for dataframe**

In [0]:
sales.head(5)

Unnamed: 0,Order_Number,Sales_Channel,Warehouse_Code,Purchased_Date,Order_Date,Ship_Date,Delivery_Date,Currency_Code,Sales_Team_ID,Customer_ID,Store_ID,Product_ID,Order_Quantity,Discount_Applied,Unit_Price,Unit_Cost,ingestion_data
0,SO - 0006094,In-Store,WARE-NMK1003,2019-12-01,2020-05-12,2020-06-04,2020-06-05,USD,11,2,148,25,4,0.05,1199.3,791.54,2026-02-02 12:53:31.044491
1,SO - 0006095,In-Store,WARE-UHY1004,2020-03-10,2020-05-13,2020-05-21,2020-05-23,USD,9,20,215,13,3,0.15,783.9,642.8,2026-02-02 12:53:31.044491
2,SO - 0006096,In-Store,WARE-MKL1006,2020-03-10,2020-05-13,2020-05-24,2020-05-29,USD,2,48,349,32,2,0.1,268.0,198.32,2026-02-02 12:53:31.044491
3,SO - 0006097,Online,WARE-PUJ1005,2019-12-01,2020-05-13,2020-05-19,2020-05-20,USD,13,27,292,28,6,0.075,2010.0,1386.9,2026-02-02 12:53:31.044491
4,SO - 0006098,Wholesale,WARE-NMK1003,2019-12-01,2020-05-13,2020-05-25,2020-05-26,USD,26,24,140,34,1,0.15,3973.1,3377.14,2026-02-02 12:53:31.044491


## Writing Silver Table

In [0]:
# Convert Pandas to Spark
spark_df = spark.createDataFrame(sales)


spark_df.write \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .format("delta") \
    .saveAsTable("sales.silver_layer.sales")

## **Sanity check of silver table**

In [0]:
%sql
SELECT * FROM sales.silver_layer.sales LIMIT 10 ; 

Order_Number,Sales_Channel,Warehouse_Code,Purchased_Date,Order_Date,Ship_Date,Delivery_Date,Currency_Code,Sales_Team_ID,Customer_ID,Store_ID,Product_ID,Order_Quantity,Discount_Applied,Unit_Price,Unit_Cost,ingestion_data
SO - 0006093,Distributor,WARE-PUJ1005,2020-03-10T00:00:00.000Z,2020-05-12T00:00:00.000Z,2020-05-25T00:00:00.000Z,2020-06-01T00:00:00.000Z,USD,24,7,312,21,3,0.05,6177.4,5065.47,2026-02-02T12:53:31.044Z
SO - 000101,In-Store,WARE-UHY1004,2017-12-31T00:00:00.000Z,2018-05-31T00:00:00.000Z,2018-06-14T00:00:00.000Z,2018-06-19T00:00:00.000Z,USD,6,15,259,12,5,0.075,1963.1,1001.18,2026-02-02T12:53:31.044Z
SO - 000102,Online,WARE-NMK1003,2017-12-31T00:00:00.000Z,2018-05-31T00:00:00.000Z,2018-06-22T00:00:00.000Z,2018-07-02T00:00:00.000Z,USD,14,20,196,27,3,0.075,3939.6,3348.66,2026-02-02T12:53:31.044Z
SO - 000103,Distributor,WARE-UHY1004,2017-12-31T00:00:00.000Z,2018-05-31T00:00:00.000Z,2018-06-21T00:00:00.000Z,2018-07-01T00:00:00.000Z,USD,21,16,213,16,1,0.05,1775.5,781.22,2026-02-02T12:53:31.044Z
SO - 000104,Wholesale,WARE-NMK1003,2017-12-31T00:00:00.000Z,2018-05-31T00:00:00.000Z,2018-06-02T00:00:00.000Z,2018-06-07T00:00:00.000Z,USD,28,48,107,23,8,0.075,2324.9,1464.69,2026-02-02T12:53:31.044Z
SO - 000105,Distributor,WARE-NMK1003,2018-04-10T00:00:00.000Z,2018-05-31T00:00:00.000Z,2018-06-16T00:00:00.000Z,2018-06-26T00:00:00.000Z,USD,22,49,111,26,8,0.1,1822.4,1476.14,2026-02-02T12:53:31.044Z
SO - 000106,Online,WARE-PUJ1005,2017-12-31T00:00:00.000Z,2018-05-31T00:00:00.000Z,2018-06-08T00:00:00.000Z,2018-06-13T00:00:00.000Z,USD,12,21,285,1,5,0.05,1038.5,446.56,2026-02-02T12:53:31.044Z
SO - 000107,In-Store,WARE-XYS1001,2017-12-31T00:00:00.000Z,2018-05-31T00:00:00.000Z,2018-06-08T00:00:00.000Z,2018-06-14T00:00:00.000Z,USD,10,14,6,5,4,0.15,1192.6,536.67,2026-02-02T12:53:31.044Z
SO - 000108,In-Store,WARE-PUJ1005,2018-04-10T00:00:00.000Z,2018-05-31T00:00:00.000Z,2018-06-26T00:00:00.000Z,2018-07-01T00:00:00.000Z,USD,6,9,280,46,5,0.05,1815.7,1525.19,2026-02-02T12:53:31.044Z
SO - 000109,In-Store,WARE-PUJ1005,2017-12-31T00:00:00.000Z,2018-06-01T00:00:00.000Z,2018-06-16T00:00:00.000Z,2018-06-21T00:00:00.000Z,USD,4,9,299,47,4,0.3,3879.3,2211.2,2026-02-02T12:53:31.044Z
