# 🥈 **Silver Layer: Data Cleaning & Standardization**


The **Silver Layer** represents the "Cleaned" and "Validated" state of our data. In this stage, we transform the raw Bronze data into a queryable format by applying a strict sequence of cleaning and casting operations.

# 🚀 ETL Transformation: Regions Silver Layer
**Objective:** Standardize geographical data by cleaning state codes and removing redundant regional records.

---

### âœ… Summary of Transformation Steps
| Step | Operation | Tool |
| :--- | :--- | :--- |
| **1** | String Cleaning | Pandas `.str.strip()` |
| **2** | Deduplication | Pandas `.drop_duplicates()` |
| **3** | Rename Columns | Pandas `.rename()` |
| **4** | Data Type Mapping | Spark `createDataFrame()` |
| **5** | Silver Layer Save | Delta Lake Table |

### 1. Data Acquisition & Environment Transition
The process began by loading the raw regional data from the Bronze Layer into a Spark DataFrame, which was then converted to Pandas to facilitate granular string manipulation.
* **Source Table:** `sales.bronz_layer.regions`
* **Method:** `.toPandas()` for high-performance localized cleaning.



---

In [0]:
import pandas as pd 
import pyspark.sql.functions as f

regions = spark.read.table("sales.bronz_layer.regions")
regions = regions.toPandas()
print(type(regions))
regions.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,StateCode,State,Region,ingestion_data
0,AL,Alabama,South,2026-02-02 12:53:43.463978
1,AR,Arkansas,South,2026-02-02 12:53:43.463978
2,AZ,Arizona,West,2026-02-02 12:53:43.463978
3,CA,California,West,2026-02-02 12:53:43.463978
4,CO,Colorado,West,2026-02-02 12:53:43.463978


### 2. Automated Whitespace Trimming
Geographical codes (like State Codes) are highly sensitive to hidden spaces which can break future map visualizations or joins with sales data.
* **Detection:** Identified all columns with `object` (string) data types.
* **Action:** Iterated through all string columns using `.str.strip()` to remove leading and trailing spaces.
* **Result:** Ensured consistency across codes (e.g., `" NY"` and `"NY "` were standardized to `"NY"`).

---

In [0]:
# identify columns contain texts 
string_columns = regions.select_dtypes(include = ['object']).columns
# loop through and trim leading and traialing whitespaces 
for col in string_columns: 
    regions[col] = regions[col].astype(str).str.strip()

print(f"Trimmed {len(string_columns)} columns : {list(string_columns)}")

Trimmed 3 columns : ['StateCode', 'State', 'Region']


### 3. Regional Deduplication
After cleaning the string columns, the dataset was scanned for duplicate rows that may have been introduced during the data entry phase in the source systems.
* **Process:** * Calculated initial row count vs. duplicate count using `.duplicated().sum()`.
    * Removed duplicates while retaining the `first` occurrence.
* **Integrity:** Executed `.reset_index(drop=True)` to ensure a continuous index for the cleaned dataset.



---

In [0]:
print(f"lenght of dataframe {len(regions)}")
duplicate_count= regions.duplicated().sum() #count duplicated rows
print(f"found{duplicate_count} duplicated rows")
regions = regions.drop_duplicates(keep = 'first')
regions = regions.reset_index(drop = True) 
print (f"length after removing duplicates {len(regions)}")

lenght of dataframe 52
found4 duplicated rows
length after removing duplicates 48


### 4. Schema Refinement & Naming Conventions
To ensure the table is "Business Ready" and matches the Silver Layer naming standards:
* **Renaming:** Converted `StateCode` to `State_Code`.
* **Standardization:** Applied snake_case formatting to improve readability for SQL analysts and BI tools.

---

### 5. Delta Lake Finalization
The processed data was converted back into a Spark DataFrame and written to the Delta Lake.
* **Storage Format:** **Delta** (Enabling Time Travel and ACID transactions).
* **Write Strategy:** `.mode("overwrite")` combined with `.option("overwriteSchema", "true")`.
* **Destination:** `sales.silver_layer.regions`



---

In [0]:
regions = regions.rename(columns = {
    'StateCode' : "State_Code"
})

regions_spark = spark.createDataFrame(regions)
regions_spark.write\
    .mode("overwrite")\
    .format("delta")\
    .option("overwriteSchema","true")\
    .saveAsTable("sales.silver_layer.regions")

## **sanity check of silver table**

In [0]:
%sql
SELECT * FROM sales.silver_layer.regions LIMIT 10;

State_Code,State,Region,ingestion_data
AL,Alabama,South,2026-02-02T12:53:43.463Z
AR,Arkansas,South,2026-02-02T12:53:43.463Z
AZ,Arizona,West,2026-02-02T12:53:43.463Z
CA,California,West,2026-02-02T12:53:43.463Z
CO,Colorado,West,2026-02-02T12:53:43.463Z
CT,Connecticut,Northeast,2026-02-02T12:53:43.463Z
DC,District of Columbia,South,2026-02-02T12:53:43.463Z
DE,Delaware,South,2026-02-02T12:53:43.463Z
FL,Florida,South,2026-02-02T12:53:43.463Z
GA,Georgia,South,2026-02-02T12:53:43.463Z
