# Create Silver Notebook with Slowly Changing Dimension Type 2

### CSIS4495-050: Applied Research Project

#### End-to-End Data Engineering Solution for HR Analytics

Group:
- Bruno do Nascimento Beserra
- Jay Clark Bermudez
- Matheus Filipe Figueiredo

Instructor: Dr. Bambang Sarif

<hr>

### Description:

This project simulates the evolution of a mid-sized company with 5,000 employees over a period of seven years. To build the initial workforce, we used a Kaggle dataset containing employee information and extracted a representative sample to serve as our company’s employee force.

To showcase our data pipeline solution built with state-of-the-art techniques. We designed a realistic simulation environment that captures key workforce dynamics over time. Throughout the seven-year period, employees may experience promotions, change teams, or leave the company. In parallel, the company will continuously hire new employees, based in their information from the main dataset to keep the workforce evolving.

In this notebook, we implement a Slowly Changing Dimension Type 2. This allows us to track historical changes in our data while efficiently managing storage. By doing that, it helps reduce size of data and maintain historical accuracy, which is especially valuable when working with large datasets.

<hr>

### Step by Step:

- Import Libraries and Dataset
- Define Configurations
- Read Bronze table according with the data available in Silver
- Data Cleaning
- Develop of Slowly Changing Dimension Type 2 Technique (In Progress)
- Save Silver Table


In [0]:
# Import Libraries and Start Spark Session
from pyspark.sql import functions as f
from pyspark.sql import SparkSession
from pyspark.sql.utils import AnalysisException
import os

spark = SparkSession.builder.getOrCreate()

In [0]:
bronze_table = "workspace.applied_research_bronze.hr_bronze_data"
silver_table = "workspace.applied_research_silver.hr_silver_data"

In [0]:
silver_exists = spark.catalog.tableExists(silver_table)
print(silver_exists)

In [0]:
# Check last ingestion date from silver table
last_ingestion_timestamp = None

if silver_exists:
    last_timestamp_df = spark.table(silver_table).select(f.max("ingestion_timestamp").alias("last_timestamp"))
    last_ingestion_timestamp = last_timestamp_df.collect()[0]["last_timestamp"]
print(last_ingestion_timestamp)

In [0]:
if last_ingestion_timestamp is not None:
    df = spark.sql(f"""
    SELECT *
    FROM {bronze_table}
    WHERE ingestion_timestamp > '{last_ingestion_timestamp}'
    """)
else:
    df = spark.table(bronze_table)

In [0]:
df.count()

In [0]:
# Check Schema of the dataset
df.printSchema()

In [0]:
# Drop Auxiliar Columns created during snapshots creation
df = df.drop("snapshot_date", "time_in_company", "previous_job_level", "last_raise_year", "month", "promotion_count")

In [0]:
# Rename columns following snake_case structure
df = df.withColumnsRenamed(
    {
        "Employee_ID": "employee_id",
        "Full_Name": "full_name",
        "Department": "department",
        "Job_Title": "job_title",
        "Hire_Date": "hire_date",
        "Location": "location",
        "Performance_Rating": "performance_rating",
        "Experience_Years": "experience_years",
        "Status": "status",
        "Work_Mode": "work_mode",
        "Annual_Salary": "annual_salary",
        "Job_Level": "job_level"
    })

In [0]:
# Fix column order in the output
df = df.select(
    "employee_id",
    "full_name",
    "department",
    "job_title",
    "hire_date",
    "location",
    "performance_rating",
    "experience_years",
    "status",
    "work_mode",
    "annual_salary",
    "job_level",
    "ingestion_timestamp"
)

## SCD Type 2

In [0]:
tracked_columns = ["employee_id", "full_name", "department", "job_title", "location", "performance_rating", "status", "job_level"]

In [0]:
# Create metadata for SCD type 2
df = df.withColumn("data_hash", f.sha2(f.concat_ws("_", *tracked_columns), 256)) \
    .withColumn("start_effectivity_date", f.col("ingestion_timestamp")) \
    .withColumn("end_effectivity_date", f.lit("9999-12-31 23:59:59").cast("date")) \
    .withColumn("is_current", f.lit(True)) 

In [0]:
display(df.head(5))

In [0]:
if silver_exists:
    # Append only new data
    df.write.format("delta").mode("append").saveAsTable(silver_table)
    print("Appended new data to existing Silver table.")
else:
    # Create the Silver table
    df.write.format("delta").mode("overwrite").saveAsTable(silver_table)
    print("Created new Silver table.")