%md
# 🥈 **Silver Layer: Data Cleaning & Standardization**


The **Silver Layer** represents the "Cleaned" and "Validated" state of our data. In this stage, we transform the raw Bronze data into a queryable format by applying a strict sequence of cleaning and casting operations.

# 🚀 ETL Transformation: Sales Teams Silver Layer
**Objective:** Clean, type-cast, and deduplicate sales team data to ensure accurate relationship mapping in the Silver Layer.

### âœ… Summary of Transformation Steps
| Step | Operation | Tool |
| :--- | :--- | :--- |
| **1** | Trim Whitespace | Pandas `.str.strip()` |
| **2** | Cast to Integer | Pandas `.astype("Int64")` |
| **3** | Rename Columns | Pandas `.rename()` |
| **4** | Remove Duplicates | Pandas `.drop_duplicates()` |
| **5** | Silver Layer Save | PySpark (Delta Lake) |

---

### 1. Data Ingestion & Conversion
The process started by loading the raw `sales_teams` data from the Bronze Layer. The data was converted to a Pandas DataFrame to perform specific string manipulations and data type casting.
* **Source Table:** `sales.bronz_layer.sales_teams`
* **Method:** `toPandas()` was used to allow for localized data cleaning operations.



---

In [0]:
import pandas as pd 
import pyspark.sql.functions as f 


sales_teams = spark.read.table("sales.bronz_layer.sales_teams")
sales_teams = sales_teams.toPandas()
print(type(sales_teams))
sales_teams.head()


<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,_SalesTeamID,Sales_Team,Region,ingestion_data
0,1.0,Adam Hernandez,Northeast,2026-02-02 12:53:46.115364
1,2.0,Keith Griffin,Northeast,2026-02-02 12:53:46.115364
2,3.0,Jerry Green,West,2026-02-02 12:53:46.115364
3,4.0,Chris Armstrong,Northeast,2026-02-02 12:53:46.115364
4,5.0,Stephen Payne,South,2026-02-02 12:53:46.115364


### 2. String Standardization & Trimming
To prevent join issues caused by hidden spaces (common in human-entered names and regions), all text columns underwent a cleaning process.
* **Operation:** Identified all `object` type columns and applied `.str.strip()`.
* **Impact:** Removed leading and trailing whitespaces, ensuring consistency in team names and regional identifiers.

---

In [0]:
string_columns = sales_teams.select_dtypes(include = ['object']).columns
for col in string_columns: 
    sales_teams[col] = sales_teams[col].astype(str).str.strip()

print(f"Trimmed {len(string_columns)} columns : {list(string_columns)} ")

Trimmed 2 columns : ['Sales_Team', 'Region'] 


### 3. Data Type Optimization & Renaming
The primary key was cleaned and converted to a proper numeric format to improve storage efficiency and query performance.
* **Type Casting:** Converted `_SalesTeamID` to `Int64`. The nullable integer type was used to ensure the process wouldn't crash if null values were present.
* **Renaming:** Standardized the column name from `_SalesTeamID` to `Sales_Team_ID` to follow the Silver Layer naming convention.



---

In [0]:
#change the sales team id to be integer 
sales_teams['_SalesTeamID'] = sales_teams['_SalesTeamID'].astype("Int64")


#Rename columns 
sales_teams = sales_teams.rename(
    columns= {
        '_SalesTeamID' : "Sales_Team_ID"
    }
)

print(f"information after data type conversion : ")
sales_teams.info()

information after data type conversion : 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Sales_Team_ID   30 non-null     Int64         
 1   Sales_Team      30 non-null     object        
 2   Region          30 non-null     object        
 3   ingestion_data  30 non-null     datetime64[ns]
dtypes: Int64(1), datetime64[ns](1), object(2)
memory usage: 1.1+ KB


### 4. Deduplication & Data Integrity
Duplicate records can lead to inflated sales figures. A deduplication step was performed to ensure each sales team is unique.
* **Check:** Used `.duplicated().sum()` to identify redundant rows.
* **Action:** Removed duplicates while keeping the `first` occurrence and reset the index to maintain a clean DataFrame structure.

---

In [0]:
print(f"the length of dataframe :  {len(sales_teams)}")
duplicated_count = sales_teams.duplicated().sum() #count douplicated rows 
print(f"found {duplicated_count} duplicated rows ")
# Remove duplicates 
sales_teams = sales_teams.drop_duplicates(keep = 'first')
sales_teams = sales_teams.reset_index(drop = True)
print(f"length after removing duplicates {len(sales_teams)}" )

the length of dataframe :  30
found 2 duplicated rows 
length after removing duplicates 28


### 5. Final Delta Table Commitment
The processed data was converted back into a Spark DataFrame and written to the Delta Lake for final consumption.
* **Format:** **Delta** (Ensuring ACID compliance and data reliability).
* **Write Strategy:** `overwrite` mode with `overwriteSchema` enabled to accommodate the renamed columns.
* **Target Table:** `sales.silver_layer.sales_teams`



---

In [0]:
#save table on silver layer
sales_teams_spark = spark.createDataFrame(sales_teams)
sales_teams_spark.write\
    .mode("overwrite")\
    .format("delta")\
    .option("overwriteSchema","true")\
    .saveAsTable("sales.silver_layer.sales_teams")

### sanity check for silver table 

In [0]:
%sql 
SELECT * FROM sales.silver_layer.sales_teams LIMIT ;

Sales_Team_ID,Sales_Team,Region,ingestion_data
1,Adam Hernandez,Northeast,2026-02-02T12:53:46.115Z
2,Keith Griffin,Northeast,2026-02-02T12:53:46.115Z
3,Jerry Green,West,2026-02-02T12:53:46.115Z
4,Chris Armstrong,Northeast,2026-02-02T12:53:46.115Z
5,Stephen Payne,South,2026-02-02T12:53:46.115Z
