# 🥈 **Silver Layer: Data Cleaning & Standardization**


The **Silver Layer** represents the "Cleaned" and "Validated" state of our data. In this stage, we transform the raw Bronze data into a queryable format by applying a strict sequence of cleaning and casting operations.

# 🚀 ETL Transformation: Products Silver Layer
**Objective:** Clean, transform, and standardize raw product data for the Silver Layer using Pandas and Spark.

---
### ✅ Summary of Transformation Steps
| Step | Operation | Tool |
| :--- | :--- | :--- |
| **1** | Empty String → `NaN` | Pandas + NumPy |
| **2** | Column Split & Trim | Lambda Function |
| **3** | Rename & Reorder | List Manipulation |
| **4** | Save as Delta Table | Spark (PySpark) |

### 1. Data Profiling & Initial Overview
The raw data was initially inspected to determine the health of the dataset. Key observations included:
* Inconsistent string formats.
* Concatenated data within single columns.
* Metadata columns positioned randomly throughout the schema.

---

In [0]:
#intialize 
import pandas as pd 
import numpy as np 
import pyspark.sql.functions as f

#load data and cover it to data frame 
products = spark.read.table("sales.bronz_layer.products")
products = products.toPandas()
print(type(products))
products.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,_ProductID,product_and_category_,ingestion_data
0,1,Cookware/Kitchen & Dining,2026-02-02 12:53:34.913570
1,2,Photo Frames/Home Decor,2026-02-02 12:53:34.913570
2,3,Table Lamps/Home Decor,2026-02-02 12:53:34.913570
3,4,Serveware/Kitchen & Dining,2026-02-02 12:53:34.913570
4,5,Bathroom Furniture/Furniture,2026-02-02 12:53:34.913570


### 2. Handling "Hidden" Missing Values
Standard null-checking methods failed to detect empty cells because they contained whitespace or empty strings. 
* **Action:** Used Regex to identify strings matching `""` or `" "` and converted them to `NaN`.
* **Reasoning:** Pandas requires true `NaN` types to perform efficient data dropping and statistical analysis. 

---

In [0]:
# This looks for cells that are empty or contain only whitespace
products = products.replace(r'^\s*$', np.nan, regex=True)
print("--- Missing Value Summary ---")
null_summary = products.isnull().sum()
print(null_summary[null_summary > 0]) # Only shows columns with issues
products = products.dropna()
products = products.reset_index(drop=True)

--- Missing Value Summary ---
_ProductID    2
dtype: int64


### 3. Feature Engineering: Lambda Splitting & Trimming
The column `product_and_category_` contained both the product name and its category separated by a `/`.
* **Logic:** Applied a `lambda` function combined with `.split()` and `.strip()`.
* **Result:** * Created a new `Category` column from the text after the slash.
    * Updated the original column to keep only the prefix before the slash.
    * Removed all trailing whitespace for a clean, "trimmed" result.



---
### 4. Schema Refinement & Column Arrangement
To ensure the table follows the Silver Layer's structural standards:
* **Renaming:** Updated internal technical names to clear business headers.
* **Metadata Positioning:** Reordered the schema to push `ingestion_date` to the final position.
* **Why:** This keeps the primary business attributes (ID, Name, Price) at the front of the table for better user accessibility.



---

In [0]:
# 1. Split Texts
# We create the new column first
products['Category'] = products['product_and_category_'].map(lambda x: str(x).split('/')[-1].strip())

# Then we update the original column
products['product_and_category_'] = products['product_and_category_'].map(lambda x: str(x).split('/')[0].strip())

# 2. Rename Columns 
# Double check the spelling of 'Product_ID' vs 'Products_Name'
products = products.rename(columns = {
    'product_and_category_' : 'Product_Name', # Removed the extra 's' for consistency
    '_ProductID' : 'Product_ID'
})

# 3. Ideal Order
# Corrected the typos 'Prodcut' -> 'Product'
ideal_order = ['Product_ID', 'Product_Name', 'Category', 'ingestion_data']

# Ensure only existing columns are called
products = products[ideal_order]

products.head(5)

Unnamed: 0,Product_ID,Product_Name,Category,ingestion_data
0,1,Cookware,Kitchen & Dining,2026-02-02 12:53:34.913570
1,2,Photo Frames,Home Decor,2026-02-02 12:53:34.913570
2,3,Table Lamps,Home Decor,2026-02-02 12:53:34.913570
3,4,Serveware,Kitchen & Dining,2026-02-02 12:53:34.913570
4,5,Bathroom Furniture,Furniture,2026-02-02 12:53:34.913570


### 5. Transition to Spark & Delta Lake
To finalize the process and ensure scalability within the Databricks environment:
* **Conversion:** The processed Pandas DataFrame was converted into a **Distributed Spark DataFrame**.
* **Storage:** Saved the result as a **Delta Table** using `.mode("overwrite")` to refresh the Silver Layer in the Lakehouse.

---

In [0]:
# Convert Pandas to Spark
spark_df = spark.createDataFrame(products)


spark_df.write \
    .mode("overwrite") \
    .format("delta") \
    .option("overwriteSchema", "true") \
    .saveAsTable("sales.silver_layer.products")