
## Silver Orchestration & Data Quality Process

# 1. **Load Bronze Data**
#    - Read raw data from the bronze table.

# 2. **Run Silver Pipeline**
#    - Apply transformations and data quality (DQ) checks using the silver orchestrator.

# 3. **Write Silver Table**
#    - Save the processed data to the silver layer.

# 4. **Generate Data Quality Reports**
#    - Produce profiling and DQ metrics for both bronze and silver data.

# 5. **Display Profiling Results**
#    -Visualize DQ metrics and profiling summaries.


In [0]:
dbutils.widgets.text("env", "dev")
env = dbutils.widgets.get("env")
catalog = f"supply_{env}"

In [0]:
df_bronze = spark.table(f"{catalog}.bronze.makeup_supply_chain_raw")
display(df_bronze.limit(10))

In [0]:
(df_bronze.count(), len(df_bronze.columns))

## Product & Sales Information
| Field                        | Description                                     |
| ---------------------------- | ----------------------------------------------- |
| **Product Type**             | Category or type of product in the supply chain |
| **SKU (Stock Keeping Unit)** | Unique identifier for each product              |
| **Price**                    | Selling price of the product                    |
| **Number of Products Sold**  | Quantity of units sold in a given period        |
| **Revenue Generated**        | Revenue from product sales                      |

## Customer Information
| Field                     | Description                                            |
| ------------------------- | ------------------------------------------------------ |
| **Customer Demographics** | Customer characteristics (age, gender, location, etc.) |

## Inventory & Stock
| Field            | Description                 |
| ---------------- | --------------------------- |
| **Availability** | Product availability status |
| **Stock Levels** | Quantity currently in stock |

## Orders & Shipping
| Field                    | Description                                 |
| ------------------------ | ------------------------------------------- |
| **Order Quantities**     | Number of units in each order               |
| **Shipping Times**       | Time taken to deliver products              |
| **Shipping Carriers**    | Carrier or service responsible for shipment |
| **Shipping Costs**       | Cost associated with shipping               |
| **Transportation Modes** | Mode of transport (air, sea, land)          |
| **Routes**               | Shipping paths used for delivery            |

## Suppliers & Manufacturing
| Field                       | Description                                   |
| --------------------------- | --------------------------------------------- |
| **Supplier Name**           | Vendor providing the product/material         |
| **Location**                | Warehouse, supplier, or distribution location |
| **Lead Time**               | Time required to receive goods from supplier  |
| **Production Volumes**      | Units produced in a given period              |
| **Manufacturing Lead Time** | Time required to manufacture a product        |
| **Manufacturing Costs**     | Costs associated with production              |

## Quality & Inspection
| Field                  | Description                               |
| ---------------------- | ----------------------------------------- |
| **Inspection Results** | Outcome of quality checks                 |
| **Defect Rates**       | Percentage or count of defective products |

## General Cost Information
| Field     | Description                                      |
| --------- | ------------------------------------------------ |
| **Costs** | Operational costs across supply chain activities |


In [0]:
from utils.dq_reporting import profiling_report_to_df
from utils.config_loader import load_config
from utils.dq_reporting import profiling_report_to_df
from pipelines.silver_orchestrator import run_silver_pipeline

In [0]:
# Workaround for Databricks Repos import issue on serverless compute
# Load utility functions by executing the Python files directly

# Databricks Repos on serverless compute can have import path issues due to how the workspace file system is mounted.
# Even if the utils folder exists, standard Python imports may fail because /Workspace/Repos is not on sys.path by default,
# or because of differences in how serverless clusters handle file system mounts and isolation.
# Using exec to load the modules directly from their file paths is a workaround for these import issues.
# import sys
# sys.path.insert(0, "/Workspace/Repos/adm-shah@sitstest.org/databricks-lab")

# # Use exec to load the modules
# with open("/Workspace/Repos/adm-shah@sitstest.org/databricks-lab/utils/dq_profiling.py") as f:
#     exec(f.read(), globals())

# with open("/Workspace/Repos/adm-shah@sitstest.org/databricks-lab/utils/dq_reporting.py") as f:
#     exec(f.read(), globals())

# with open("/Workspace/Repos/adm-shah@sitstest.org/databricks-lab/utils/config_loader.py") as f:
#     exec(f.read(), globals())
# with open("/Workspace/Repos/adm-shah@sitstest.org/databricks-lab/utils/silver_structural.py") as f:
#     exec(f.read(), globals())

# print("âœ“ Utility functions loaded")

We will apply the silver structure after doing the data quality checks

In [0]:
nb_path = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()

silver_config = load_config("transactions", "silver", nb_path)
dq_config = load_config("transactions", "dq", nb_path)

result = run_silver_pipeline(df_bronze, silver_config, dq_config)

silver_df = result["silver_df"]

silver_df.write.format("delta").mode("overwrite").saveAsTable(f"{catalog}.silver.transactions")


In [0]:
bronze_report = result["bronze_report"]
silver_report = result["silver_report"]

In [0]:
metrics_df = profiling_report_to_df(spark, "transactions", bronze_report)
display(metrics_df)
print("\n===== DATA QUALITY REPORT =====\n")

for key, value in silver_report.items():
    print(f"\n--- {key.upper()} ---")

    if hasattr(value, "show"):   # it's a Spark DataFrame
        value.show(truncate=False)

    elif isinstance(value, dict):  # it's a metrics dict (like duplicates)
        for k, v in value.items():
            print(f"{k}: {v}")

    else:
        print(value)