# Create Bronze Notebook - Data Ingestion

### CSIS4495-050: Applied Research Project

#### End-to-End Data Engineering Solution for HR Analytics

Group:
- Bruno do Nascimento Beserra
- Jay Clark Bermudez
- Matheus Filipe Figueiredo

Instructor: Dr. Bambang Sarif

<hr>

### Description:

This project simulates the evolution of a mid-sized company with 5,000 employees over a period of seven years. To build the initial workforce, we used a Kaggle dataset containing employee information and extracted a representative sample to serve as our company’s employee force.

To showcase our data pipeline solution built with state-of-the-art techniques. We designed a realistic simulation environment that captures key workforce dynamics over time. Throughout the seven-year period, employees may experience promotions, change teams, or leave the company. In parallel, the company will continuously hire new employees, based in their information from the main dataset to keep the workforce evolving.

In this notebook, we ingest raw CSV files stored in a structured folder hierarchy (`year/month/data.csv`) into a Delta table named **`bronze_data`**, adding ingestion metadata for traceability and enabling incremental data processing.

<hr>

### Step by Step:

- Import Libraries and Dataset
- Define Configurations
- Check root folder for new files
- Read data
- Create ingestion_timestamp column
- Save Bronze Table



In [0]:
# Import Libraries and Start Spark Session
from pyspark.sql import functions as f
from pyspark.sql import SparkSession
from pyspark.sql.utils import AnalysisException
import os

spark = SparkSession.builder.getOrCreate()

In [0]:
# Define path with dataset
base_path = "/Volumes/workspace/my_data/historical_data/unzipped/historical_data"

metadata_file = f"{base_path}/metadata.txt"
bronze_table = "workspace.applied_research_bronze.hr_bronze_data"

In [0]:
try:
    prev_files = (
        dbutils.fs.head(metadata_file, 1000000)
        .strip()
        .split("\n")
    )
except Exception:
    prev_files = []

ingested_files = set(f for f in prev_files if f.strip() != "")

In [0]:
def list_all_csv_files(base_path):
    files = []
    for year in [f.path for f in dbutils.fs.ls(base_path)]:
        for month in [m.path for m in dbutils.fs.ls(year)]:
            for f in dbutils.fs.ls(month):
                if f.path.endswith(".csv"):
                    files.append(f.path)
    return files

all_csv_files = list_all_csv_files(base_path)

In [0]:
new_files = [f for f in all_csv_files if f not in ingested_files]

if not new_files:
    print("No new files to ingest today.")
else:
    print(f"Found {len(new_files)} new files:")
    for nf in new_files:
        print(" -", nf)

In [0]:
if new_files:
    df_new = (
        spark.read
            .option("header", "true")
            .option("inferSchema", "true")
            .csv(new_files)
    )
    if "ingestion_timestamp" not in df_new.columns:
        df_new = df_new.withColumn("ingestion_timestamp", f.current_timestamp())
    else:
        print("'ingestion_timestamp' already exists in the data — skipping column creation.")

    # Append to Bronze Delta table
    df_new.write.format("delta").mode("append").saveAsTable(bronze_table)

    # Update metadata file with the newly ingested file paths
    all_ingested = list(ingested_files.union(new_files))
    dbutils.fs.put(metadata_file, "\n".join(all_ingested), overwrite=True)

    print(f"Ingested {len(new_files)} new files and updated metadata.")