# Load DQX Checks

## Reference

In [0]:
"""
# Runbook — `01_load_dqx_checks` (DQX Rule Loader)

**Purpose**  
Load YAML-defined DQX rules into the **checks config** table (Unity Catalog), with strict validation, canonicalization, and a stable `check_id` derived from a canonical JSON payload.

## Environment / Imports (exact sources)

- **Databricks Runtime** with `pyspark`
- **Package**: `databricks-labs-dqx==0.8.x`
- **Engine**: `databricks.labs.dqx.engine.DQEngine` (used for `validate_checks`)
- **Framework utils** (your repo, loaded via `add_src_to_sys_path()`):
  - `framework_utils.display`: `show_df`, `display_section`
  - `framework_utils.runtime`: `show_notebook_env`
  - `framework_utils.color`: `Color`
  - `framework_utils.console`: `Console`
  - `framework_utils.config`: `ProjectConfig`, `ConfigError`
  - `framework_utils.write`: `TableWriter`
  - `framework_utils.table`: `table_exists`
  - `framework_utils.path`: `dbfs_to_local`, `list_yaml_files`

> Note: `CHECKS_CONFIG_COMMENTS` is referenced when creating the table. Ensure it’s defined/imported in the notebook scope, or it will evaluate to `None`.

## How to Run

```python
from framework_utils.config import ProjectConfig
spark.conf.set("spark.sql.session.timeZone", "UTC")
cfg = ProjectConfig("resources/dqx_config.yaml", variables={})
run_checks_loader(spark, cfg, notebook_idx=1, dry_run=False, validate_only=False)
```

- **Dry run**: `dry_run=True` → builds DF and previews, **no write**
- **Validate only**: `validate_only=True` → validates YAMLs, **no write**

## Configuration (read from `ProjectConfig`)

- **Data source**: `notebooks.notebook_1.data_sources.data_source_1`
  - `source_path` (required): root folder for YAMLs (supports `dbfs:/` via `/dbfs` bridge)
  - `allowed_criticality` (required): allowed values for `criticality`
  - `required_fields` (required): required YAML fields
- **Target**: `notebooks.notebook_1.targets.target_table_1`
  - `full_table_name()` (FQN): UC table to write
  - `write`: `{format, mode, options}` (required)
  - `primary_key` (required): PK column name (the code supplies it to `create_table`)
  - `partition_by`, `table_description`, `table_tags` (optional)

## Data Model (written schema)

`CHECKS_CONFIG_STRUCT` (defined in this notebook) with fields:

- `check_id` (PK semantics): `sha256` of canonical payload
- `check_id_payload` (canonical JSON)
- `table_name`, `name`, `criticality`
- `check`: `{function, for_each_column?, arguments?}` (values stringified)
- `filter`, `run_config_name`, `user_metadata`
- `yaml_path`, `active`
- `created_by`, `created_at`, `updated_by`, `updated_at`

> Identity inputs for `check_id`: **`table_name` (lowercased), normalized `filter`, canonical `check`**.  
> **Not** included: `name`, `criticality`, `run_config_name`.

## Processing Logic

1. **Discover YAMLs**: `list_yaml_files(source_path)` → show list (Spark DF).
2. **Validation (file)**:
   - Non-empty; no duplicate `name` within file.
3. **Validation (rule)**:
   - All `required_fields` present.
   - `table_name` is fully qualified (`catalog.schema.table`).
   - `criticality` ∈ `allowed_criticality`.
   - `check.function` present (non-empty).
4. **DQX validation**: `DQEngine.validate_checks(rules)` per file (raises on errors).
5. **Canonicalization**:
   - `filter` normalized whitespace.
   - `check.for_each_column` sorted (or `None`).
   - `check.arguments` stringified (JSON/bool/null/scalars).
   - Build payload JSON (`sort_keys=True`, compact separators), then `sha256`.
6. **Audit**: `created_by="AdminUser"`, `created_at=UTC now`.
7. **Batch de-dupe** (by `check_id`):
   - Keep lexicographically first by `(yaml_path, name)`.
   - Mode: **warn** (print), **error** (raise), **skip** (keep all).
8. **Write**:
   - If table missing → `TableWriter.create_table(...)` with `CHECKS_CONFIG_STRUCT`, PK, optional comments/tags.
   - Write via `TableWriter.write_df` (format/mode/options from target block).  
     DF columns are projected to target schema order.
9. **Diagnostics**:
   - Summary counts: total, distinct `check_id`, distinct `(check_id, run_config_name)`.
   - Write result: rows written, table FQN, mode.

## Failure Modes (and what the code does)

- **Missing config keys** → `must(...)` raises `ConfigError`.
- **Empty or invalid YAML file** → `validate_rules_file` raises.
- **Bad rule fields** → `validate_rule_fields` raises with details.
- **DQX schema/semantics error** → `validate_with_dqx` raises `ValueError` with `status.to_string()`.
- **Duplicate `check_id`**:
  - `mode="warn"` prints dropped block(s), continues.
  - `mode="error"` raises.
  - `mode="skip"` keeps all.
- **No rules after de-dupe** → short-circuit return (no write).

## Post-Run sanity SQL (example)

```sql
SELECT COUNT(*) AS rules,
       COUNT(DISTINCT check_id) AS unique_rules
FROM <your_fqn>;

SELECT check_id, COUNT(*) c
FROM <your_fqn>
GROUP BY check_id
HAVING COUNT(*) > 1;  -- should be 0 when dedupe_mode != "skip"
```

## Notes

- Overwrite semantics: this is the **system of record** for rules derived from YAML; manual edits in the table will be lost on next run.  
- `check_id` identity includes only `{table_name↓, filter, check.*}` — **not** `name`, `criticality`, or `run_config_name`.  
- Argument values are persisted as strings for stability; cast/parse (e.g., `try_cast`, `from_json`) downstream.
"""

In [0]:
"""
START: run_checks_loader(spark, cfg, *, notebook_idx, dry_run=False, validate_only=False)
|
|-- 0. Bootstrap
|     |-- add_src_to_sys_path(src_dir="src", sentinel="framework_utils")
|     |     └─ (defined inline in this notebook)
|     |-- show_notebook_env(spark)                     ── from framework_utils.runtime
|     |-- NOTE (in __main__): spark.sql.session.timeZone = "UTC"
|
|-- 1. Imports (explicit sources)
|     |-- pyspark.sql.{SparkSession, DataFrame, types as T, functions as F}
|     |-- databricks.labs.dqx.engine.DQEngine
|     |-- framework_utils.display.{show_df, display_section}
|     |-- framework_utils.runtime.show_notebook_env
|     |-- framework_utils.color.Color
|     |-- framework_utils.console.Console
|     |-- framework_utils.config.{ProjectConfig, ConfigError}
|     |-- framework_utils.write.TableWriter
|     |-- framework_utils.table.table_exists
|     |-- framework_utils.path.{dbfs_to_local, list_yaml_files}
|
|-- 2. Constants / schema
|     |-- CHECKS_CONFIG_STRUCT (StructType)            ── defined inline in this notebook
|     |-- (Optional) CHECKS_CONFIG_COMMENTS            ── referenced in create_table; must exist in scope or be None
|
|-- 3. Utility functions (defined inline)
|     |-- must(val, name) → raise ConfigError on missing
|     |-- _canon_filter(s) → normalize whitespace
|     |-- _stringify_map_values(map) → stringify JSON/bool/null/scalars
|     |-- _canon_check(check) → {function, sorted for_each_column|None, arguments(sorted,stringified)}
|     |-- compute_check_id_payload(table_name, check, filter) → canonical JSON (lowercased table_name)
|     |-- compute_check_id_from_payload(payload) → sha256 hex
|     |-- load_yaml_rules(path) → list[dict] using yaml.safe_load_all + dbfs_to_local (framework_utils.path)
|     |-- validate_rules_file(rules, file_path) → nonempty + no duplicate rule names
|     |-- validate_rule_fields(rule, file_path, required_fields, allowed_criticality)
|     |-- validate_with_dqx(rules, file_path) → DQEngine.validate_checks(rules)
|     |-- process_yaml_file(path, required_fields, created_by_value, allowed_criticality)
|     |     ├─ load_yaml_rules + file/rule validation
|     |     ├─ payload/check_id build
|     |     └─ produce rows with audit fields (created_by="AdminUser", created_at=UTC now)
|     |-- dedupe_rules_in_batch_by_check_id(rules, mode) → group by check_id; keep first by (yaml_path, name);
|     |     prints/warns/errors via Console.* depending on mode
|     |-- discover_yaml(cfg, rules_dir)
|     |     ├─ list_yaml_files(rules_dir)               ── from framework_utils.path
|     |     └─ display_section + show_df of discovered files
|     |-- build_df_from_rules(spark, rules) → createDataFrame(rules, CHECKS_CONFIG_STRUCT)
|
|-- 4. Resolve config (ProjectConfig)
|     |-- nb = cfg.notebook(notebook_idx)
|     |-- ds = nb.data_sources().data_source(1)
|     |     ├─ rules_dir       = must(ds["source_path"])
|     |     ├─ allowed_crit    = must(ds["allowed_criticality"])
|     |     └─ required_fields = must(ds["required_fields"])
|     |-- t  = nb.targets().target_table(1)
|           ├─ fqn           = t.full_table_name()
|           ├─ write_block   = must(t["write"])          # {format, mode, options}
|           ├─ partition_by  = t.get("partition_by") or []
|           ├─ table_comment = t.get("table_description")
|           ├─ table_tags    = t.table_tags()
|           └─ primary_key   = must(t["primary_key"])
|
|-- 5. Ensure target table exists
|     |-- if not table_exists(spark, fqn):                ── framework_utils.table
|           └─ TableWriter.create_table(…)
|                fqn=fqn,
|                schema=CHECKS_CONFIG_STRUCT,
|                format=write_block["format"],
|                options=write_block.get("options") or {},
|                partition_by=partition_by,
|                table_comment=table_comment,
|                column_comments=(CHECKS_CONFIG_COMMENTS or None),
|                table_properties=None,
|                table_tags=table_tags,
|                column_tags=None,
|                primary_key_cols=[primary_key]
|
|-- 6. Discover YAMLs
|     |-- yaml_files = discover_yaml(cfg, rules_dir)      ── prints and shows files
|
|-- 7. validate_only short-circuit (if True)
|     |-- for p in yaml_files: validate_rules_file(load_yaml_rules(p), p)
|     |-- RETURN {config_path=cfg.path, rules_files=len(yaml_files), errors=[…]}
|
|-- 8. Load + validate + canonicalize + ID
|     |-- all_rules = []
|     |-- for p in yaml_files:
|           ├─ file_rules = process_yaml_file(p, required_fields, created_by_value="AdminUser", allowed_criticality=allowed_crit)
|           ├─ if file_rules: extend all_rules; print Console.LOADER summary
|     |-- pre_dedupe  = len(all_rules)
|     |-- rules       = dedupe_rules_in_batch_by_check_id(all_rules, mode=cfg.variables.batch_dedupe_mode)
|     |-- post_dedupe = len(rules)
|     |-- if not rules: RETURN {config_path, rules_files, wrote_rows=0, target_table=fqn}
|
|-- 9. Build DataFrame + quick diagnostics
|     |-- df = build_df_from_rules(spark, rules)
|     |-- display_section("SUMMARY OF RULES LOADED FROM YAML")
|     |-- show_df(totals_df: [count, distinct check_id, distinct (check_id, run_config_name)])
|
|-- 10. dry_run short-circuit (if True)
|     |-- display_section("DRY-RUN: FULL RULES PREVIEW"); show_df(df.orderBy("table_name","name"))
|     |-- RETURN {metrics…, target_table=fqn, wrote_rows=0, write_mode=write_block["mode"]}
|
|-- 11. Write to target table (order aligned to target schema)
|     |-- tw.write_df(
|            df=df.select(*[f.name for f in spark.table(fqn).schema.fields]),
|            fqn=fqn,
|            mode=write_block["mode"],
|            format=write_block["format"],
|            options=write_block.get("options") or {},
|         )
|     |-- wrote_rows = df.count()
|     |-- display_section("WRITE RESULT"); show_df(summary rows, table, mode)
|     |-- print colored success line via framework_utils.color.Color
|
|-- 12. RETURN result dict
|     |-- {config_path, rules_files, rules_pre_dedupe, rules_post_dedupe,
|         unique_check_ids, distinct_rule_run_pairs, target_table=fqn,
|         wrote_rows, write_mode=write_block["mode"]}
|
END: run_checks_loader
"""

In [0]:
"""
# dq_{env}.dqx.checks_config
CHECKS_CONFIG_STRUCT = T.StructType([
    T.StructField("check_id",         T.StringType(),   False),
    T.StructField("check_id_payload", T.StringType(),   False),
    T.StructField("table_name",       T.StringType(),   False),
    T.StructField("name",             T.StringType(),   False),
    T.StructField("criticality",      T.StringType(),   False),
    T.StructField("check", T.StructType([
        T.StructField("function",        T.StringType(), False),
        T.StructField("for_each_column", T.ArrayType(T.StringType()), True),
        T.StructField("arguments",       T.MapType(T.StringType(), T.StringType()), True),
    ]), False),
    T.StructField("filter",           T.StringType(),   True),
    T.StructField("run_config_name",  T.StringType(),   False),
    T.StructField("user_metadata",    T.MapType(T.StringType(), T.StringType()), True),
    T.StructField("yaml_path",        T.StringType(),   False),
    T.StructField("active",           T.BooleanType(),  False),
    T.StructField("created_by",       T.StringType(),   False),
    T.StructField("created_at",       T.TimestampType(),False),
    T.StructField("updated_by",       T.StringType(),   True),
    T.StructField("updated_at",       T.TimestampType(),True),
])
"""

## Implementation

In [0]:
%pip install databricks-labs-dqx==0.8.0

In [0]:
dbutils.library.restartPython()

In [0]:
# Databricks notebook: 01_load_dqx_checks
# Purpose: Load YAML rules into dq_{env}.dqx.checks_config
# Requires: databricks-labs-dqx==0.8.x

from __future__ import annotations

import sys
import json
import yaml
import hashlib
from pathlib import Path
from datetime import datetime
from typing import Dict, Any, Optional, List, Tuple

from pyspark.sql import SparkSession, DataFrame, types as T, functions as F

from databricks.labs.dqx.engine import DQEngine

def add_src_to_sys_path(src_dir="src", sentinel="framework_utils", max_levels=12):
    start = Path(__file__).resolve().parent if "__file__" in globals() else Path.cwd().resolve()
    p = start
    for _ in range(max_levels):
        cand = p / src_dir
        if (cand / sentinel).exists():
            s = str(cand.resolve())
            if s not in sys.path:
                sys.path.insert(0, s)
                print(f"[bootstrap] sys.path[0] = {s}")
            return
        if p == p.parent: break
        p = p.parent
    raise ImportError(f"Couldn't find {src_dir}/{sentinel} above {start}")

add_src_to_sys_path()

from framework_utils.display import show_df, display_section
from framework_utils.runtime import show_notebook_env
from framework_utils.color import Color
from framework_utils.console import Console
from framework_utils.config import ProjectConfig, ConfigError
from framework_utils.write import TableWriter
from framework_utils.table import table_exists
from framework_utils.path import dbfs_to_local, list_yaml_files

# =========================
# SPARK STRUCTURED SCHEMA (source of truth)
# =========================
CHECKS_CONFIG_STRUCT = T.StructType([
    T.StructField("check_id",         T.StringType(),   False, {"comment": "PRIMARY KEY. Stable sha256 over canonical {table_name↓, filter, check.*}."}),
    T.StructField("check_id_payload", T.StringType(),   False, {"comment": "Canonical JSON used to derive `check_id` (sorted keys, normalized values)."}),
    T.StructField("table_name",       T.StringType(),   False, {"comment": "Target table FQN (`catalog.schema.table`). Lowercased in payload for stability."}),
    T.StructField("name",             T.StringType(),   False, {"comment": "Human-readable rule name. Used in UI/diagnostics and joins."}),
    T.StructField("criticality",      T.StringType(),   False, {"comment": "Rule severity: `error|warn`."}),
    T.StructField("check", T.StructType([
        T.StructField("function",        T.StringType(), False, {"comment": "DQX function to run"}),
        T.StructField("for_each_column", T.ArrayType(T.StringType()), True,  {"comment": "Optional list of columns"}),
        T.StructField("arguments",       T.MapType(T.StringType(), T.StringType()), True, {"comment": "Key/value args"}),
    ]), False, {"comment": "Structured rule `{function, for_each_column?, arguments?}`; values stringified."}),
    T.StructField("filter",           T.StringType(),   True,  {"comment": "Optional SQL predicate applied before evaluation (row-level)."}),
    T.StructField("run_config_name",  T.StringType(),   False, {"comment": "Execution group/tag. Not part of identity."}),
    T.StructField("user_metadata",    T.MapType(T.StringType(), T.StringType()), True, {"comment": "Free-form map<string,string>."}),
    T.StructField("yaml_path",        T.StringType(),   False, {"comment": "Absolute/volume path to the defining YAML doc (lineage)."}),
    T.StructField("active",           T.BooleanType(),  False, {"comment": "If `false`, rule is ignored by runners."}),
    T.StructField("created_by",       T.StringType(),   False, {"comment": "Audit: creator/principal that materialized the row."}),
    T.StructField("created_at",       T.TimestampType(),False, {"comment": "Audit: creation timestamp (UTC)."}),
    T.StructField("updated_by",       T.StringType(),   True,  {"comment": "Audit: last updater (nullable)."}),
    T.StructField("updated_at",       T.TimestampType(),True,  {"comment": "Audit: last update timestamp (UTC, nullable)."}),
])

# =========================
# Small local helpers (no dependency on must() inside utils.config)
# =========================
def must(val: Any, name: str) -> Any:
    if val is None or (isinstance(val, str) and not val.strip()):
        raise ConfigError(f"Missing required config: {name}")
    return val

# =========================
# Canonicalization & IDs
# =========================
def _canon_filter(s: Optional[str]) -> str:
    return "" if not s else " ".join(str(s).split())

def _stringify_map_values(d: Dict[str, Any]) -> Dict[str, str]:
    out: Dict[str, str] = {}
    for k, v in (d or {}).items():
        if isinstance(v, (list, dict)):
            out[k] = json.dumps(v)
        elif isinstance(v, bool):
            out[k] = "true" if v else "false"
        elif v is None:
            out[k] = "null"
        else:
            out[k] = str(v)
    return out

def _canon_check(chk: Dict[str, Any]) -> Dict[str, Any]:
    out = {"function": chk.get("function"), "for_each_column": None, "arguments": {}}
    fec = chk.get("for_each_column")
    if isinstance(fec, list):
        out["for_each_column"] = sorted([str(x) for x in fec]) or None
    args = chk.get("arguments") or {}
    canon_args: Dict[str, str] = {}
    for k, v in args.items():
        sv = "" if v is None else str(v).strip()
        if (sv.startswith("{") and sv.endswith("}")) or (sv.startswith("[") and sv.endswith("]")):
            try:
                sv = json.dumps(json.loads(sv), sort_keys=True, separators=(",", ":"))
            except Exception:
                pass
        canon_args[str(k)] = sv
    out["arguments"] = {k: canon_args[k] for k in sorted(canon_args)}
    return out

def compute_check_id_payload(table_name: str, check_dict: Dict[str, Any], filter_str: Optional[str]) -> str:
    payload_obj = {"table_name": (table_name or "").lower(), "filter": _canon_filter(filter_str), "check": _canon_check(check_dict or {})}
    return json.dumps(payload_obj, sort_keys=True, separators=(",", ":"))

def compute_check_id_from_payload(payload: str) -> str:
    return hashlib.sha256(payload.encode()).hexdigest()

# =========================
# Rule YAML load/validate
# =========================
def load_yaml_rules(path: str) -> List[dict]:
    p = Path(dbfs_to_local(path))
    if not p.exists():
        raise FileNotFoundError(f"Rules YAML not found: {p}")
    with open(p, "r") as fh:
        docs = list(yaml.safe_load_all(fh)) or []
    out: List[dict] = []
    for d in docs:
        if not d:
            continue
        if isinstance(d, dict):
            out.append(d)
        elif isinstance(d, list):
            out.extend([x for x in d if isinstance(x, dict)])
    return out

def validate_rules_file(rules: List[dict], file_path: str):
    if not rules:
        raise ValueError(f"No rules found in {file_path} (empty or invalid YAML).")
    probs, seen = [], set()
    for r in rules:
        nm = r.get("name")
        if not nm: probs.append(f"Missing rule name in {file_path}")
        if nm in seen: probs.append(f"Duplicate rule name '{nm}' in {file_path}")
        seen.add(nm)
    if probs: raise ValueError(f"File-level validation failed in {file_path}: {probs}")

def validate_rule_fields(rule: dict, file_path: str, required_fields: List[str], allowed_criticality: List[str]):
    probs = []
    for f in required_fields:
        if not rule.get(f): probs.append(f"Missing required field '{f}' in rule '{rule.get('name')}' ({file_path})")
    if rule.get("table_name", "").count(".") != 2:
        probs.append(f"table_name '{rule.get('table_name')}' not fully qualified in rule '{rule.get('name')}' ({file_path})")
    if rule.get("criticality") not in set(allowed_criticality):
        probs.append(f"Invalid criticality '{rule.get('criticality')}' in rule '{rule.get('name')}' ({file_path})")
    if not rule.get("check", {}).get("function"):
        probs.append(f"Missing check.function in rule '{rule.get('name')}' ({file_path})")
    if probs: raise ValueError("Rule-level validation failed: " + "; ".join(probs))

def validate_with_dqx(rules: List[dict], file_path: str):
    status = DQEngine.validate_checks(rules)
    if getattr(status, "has_errors", False):
        raise ValueError(f"DQX validation failed in {file_path}:\n{status.to_string()}")

# =========================
# Build rows from YAML docs (UTC timestamps)
# =========================
def process_yaml_file(
    path: str,
    required_fields: List[str],
    created_by_value: str,
    allowed_criticality: List[str],
) -> List[dict]:
    docs = load_yaml_rules(path)
    if not docs:
        print(f"{Console.SKIP} {path} has no rules.")
        return []

    validate_rules_file(docs, path)
    flat: List[dict] = []

    for rule in docs:
        validate_rule_fields(rule, path, required_fields, allowed_criticality)
        raw_check = rule["check"] or {}
        payload   = compute_check_id_payload(rule["table_name"], raw_check, rule.get("filter"))
        check_id  = compute_check_id_from_payload(payload)

        function = raw_check.get("function")
        if not isinstance(function, str) or not function:
            raise ValueError(f"{path}: check.function must be a non-empty string (rule '{rule.get('name')}').")

        for_each = raw_check.get("for_each_column")
        if for_each is not None and not isinstance(for_each, list):
            raise ValueError(f"{path}: check.for_each_column must be an array of strings (rule '{rule.get('name')}').")
        for_each = [str(x) for x in (for_each or [])] or None

        arguments = raw_check.get("arguments", {}) or {}
        if not isinstance(arguments, dict):
            raise ValueError(f"{path}: check.arguments must be a map (rule '{rule.get('name')}').")
        arguments = _stringify_map_values(arguments)

        user_metadata = rule.get("user_metadata")
        if user_metadata is not None:
            if not isinstance(user_metadata, dict):
                raise ValueError(f"{path}: user_metadata must be a map (rule '{rule.get('name')}').")
            user_metadata = _stringify_map_values(user_metadata)

        created_at_ts = datetime.utcnow()

        flat.append({
            "check_id": check_id,
            "check_id_payload": payload,
            "table_name": rule["table_name"],
            "name": rule["name"],
            "criticality": rule["criticality"],
            "check": {"function": function, "for_each_column": for_each, "arguments": arguments or None},
            "filter": rule.get("filter"),
            "run_config_name": rule["run_config_name"],
            "user_metadata": user_metadata or None,
            "yaml_path": path,
            "active": bool(rule.get("active", True)),
            "created_by": "AdminUser",
            "created_at": created_at_ts,
            "updated_by": None,
            "updated_at": None,
        })

    validate_with_dqx(docs, path)
    return flat

# =========================
# Dedupe (on check_id)
# =========================
def _fmt_rule_for_dup(r: dict) -> str:
    return f"name={r.get('name')} | file={r.get('yaml_path')} | criticality={r.get('criticality')} | run_config={r.get('run_config_name')} | filter={r.get('filter')}"

def dedupe_rules_in_batch_by_check_id(rules: List[dict], mode: str) -> List[dict]:
    groups: Dict[str, List[dict]] = {}
    for r in rules: groups.setdefault(r["check_id"], []).append(r)
    out: List[dict] = []; dropped = 0; blocks: List[str] = []
    for cid, lst in groups.items():
        if len(lst) == 1:
            out.append(lst[0]); continue
        lst = sorted(lst, key=lambda x: (x.get("yaml_path",""), x.get("name","")))
        keep, dups = lst[0], lst[1:]; dropped += len(dups)
        head = f"{Console.DEDUPE} {len(dups)} duplicate(s) for check_id={cid[:12]}…"
        lines = ["    " + _fmt_rule_for_dup(x) for x in lst]
        tail = f"    -> keeping: name={keep.get('name')} | file={keep.get('yaml_path')}"
        blocks.append("\n".join([head, *lines, tail])); out.append(keep)
    if dropped:
        msg = "\n\n".join(blocks) + f"\n{Console.DEDUPE} total dropped={dropped}"
        if mode == "error": raise ValueError(msg)
        if mode == "warn": print(msg)
    return out

# =========================
# Discover rule YAMLs
# =========================
def discover_yaml(cfg: ProjectConfig, rules_dir: str) -> List[str]:
    print(f"{Console.DEBUG} rules_dir (raw from YAML): {rules_dir}")
    files = list_yaml_files(rules_dir)
    display_section("YAML FILES DISCOVERED (recursive)")
    df = SparkSession.builder.getOrCreate().createDataFrame([(p,) for p in files], "yaml_path string")
    show_df(df, n=500, truncate=False)
    return files

# =========================
# Build DF
# =========================
def build_df_from_rules(spark: SparkSession, rules: List[dict]) -> DataFrame:
    return spark.createDataFrame(rules, schema=CHECKS_CONFIG_STRUCT)

# =========================
# Runner
# =========================
def run_checks_loader(
    spark: SparkSession,
    cfg: ProjectConfig,
    *,
    notebook_idx: int,
    dry_run: bool = False,
    validate_only: bool = False,
) -> Dict[str, Any]:

    apply_meta  = bool(must(cfg.get("variables.apply_table_metadata"), "variables.apply_table_metadata"))
    dedupe_mode = must(cfg.get("variables.batch_dedupe_mode"), "variables.batch_dedupe_mode")

    nb = cfg.notebook(notebook_idx)

    # Use data_sources.data_source_1
    ds = nb.data_sources().data_source(1)
    rules_dir       = must(ds.get("source_path"),         f"notebooks.notebook_{notebook_idx}.data_sources.data_source_1.source_path")
    allowed_crit    = must(ds.get("allowed_criticality"), f"notebooks.notebook_{notebook_idx}.data_sources.data_source_1.allowed_criticality")
    required_fields = must(ds.get("required_fields"),     f"notebooks.notebook_{notebook_idx}.data_sources.data_source_1.required_fields")

    # Target (checks_config)
    t = nb.targets().target_table(1)
    fqn           = t.full_table_name()
    partition_by  = t.get("partition_by") or []
    write_block   = must(t.get("write"), f"{fqn}.write")
    table_comment = t.get("table_description")
    primary_key   = must(t.get("primary_key"), f"{fqn}.primary_key")   # from YAML
    table_tags    = t.table_tags()                                     # normalized in config.py

    tw = TableWriter(spark)

    # Create if needed (no unsupported create modes)
    if not table_exists(spark, fqn):
        tw.create_table(
            fqn=fqn,
            schema=CHECKS_CONFIG_STRUCT,
            format=must(write_block.get("format"), f"{fqn}.write.format"),
            options=write_block.get("options") or {},
            partition_by=partition_by,
            table_comment=table_comment,
            column_comments=(CHECKS_CONFIG_COMMENTS or None),
            table_properties=None,          # add if you map something to TBLPROPERTIES
            table_tags=table_tags,          # tags from YAML (flat dict)
            column_tags=None,               # per-column tags if you maintain them
            primary_key_cols=[primary_key], # PK columns only; name auto-generated
        )

    yaml_files = discover_yaml(cfg, rules_dir)

    if validate_only:
        print(f"{Console.VALIDATION} Validation only: not writing any rules.")
        errs: List[str] = []
        for p in yaml_files:
            try:
                validate_rules_file(load_yaml_rules(p), p)
            except Exception as e:
                errs.append(f"{p}: {e}")
        return {"config_path": cfg.path, "rules_files": len(yaml_files), "errors": errs}

    # Build rows (UTC created_at), dedupe by check_id
    all_rules: List[dict] = []
    for full_path in yaml_files:
        file_rules = process_yaml_file(
            full_path,
            required_fields=required_fields,
            created_by_value="AdminUser",
            allowed_criticality=allowed_crit,
        )
        if file_rules:
            all_rules.extend(file_rules)
            print(f"{Console.LOADER} {full_path}: rules={len(file_rules)}")

    pre_dedupe = len(all_rules)
    rules = dedupe_rules_in_batch_by_check_id(all_rules, mode=dedupe_mode)
    post_dedupe = len(rules)

    if not rules:
        print(f"{Console.SKIP} No rules discovered; nothing to do.")
        return {"config_path": cfg.path, "rules_files": len(yaml_files), "wrote_rows": 0, "target_table": fqn}

    print(f"{Console.DEDUPE} total parsed rules (pre-dedupe): {pre_dedupe}")
    df = build_df_from_rules(spark, rules)

    display_section("SUMMARY OF RULES LOADED FROM YAML")
    totals = [(df.count(), df.select("check_id").distinct().count(), df.select("check_id", "run_config_name").distinct().count())]
    tdf = df.sparkSession.createDataFrame(
        totals,
        schema="`total number of rules found` long, `unique rules found` long, `distinct pair of rules` long",
    )
    show_df(tdf, n=1)

    if dry_run:
        display_section("DRY-RUN: FULL RULES PREVIEW")
        show_df(df.orderBy("table_name", "name"), n=1000, truncate=False)
        return {
            "config_path": cfg.path,
            "rules_files": len(yaml_files),
            "rules_pre_dedupe": pre_dedupe,
            "rules_post_dedupe": post_dedupe,
            "unique_check_ids": df.select("check_id").distinct().count(),
            "distinct_rule_run_pairs": df.select("check_id","run_config_name").distinct().count(),
            "target_table": fqn,
            "wrote_rows": 0,
            "write_mode": must(write_block.get("mode"), f"{fqn}.write.mode"),
        }

    # Write (project keeps column order via target table schema)
    tw.write_df(
        df=df.select(*[f.name for f in spark.table(fqn).schema.fields]),
        fqn=fqn,
        mode=must(write_block.get("mode"), f"{fqn}.write.mode"),
        format=must(write_block.get("format"), f"{fqn}.write.format"),
        options=write_block.get("options") or {},
    )

    wrote_rows = df.count()
    display_section("WRITE RESULT")
    summary = spark.createDataFrame([(wrote_rows, fqn, must(write_block.get("mode"), f"{fqn}.write.mode"))],
                                    schema="`rules written` long, `target table` string, `mode` string")
    show_df(summary, n=1)
    print(f"{Color.b}{Color.ivory}Finished writing rules to '{Color.r}{Color.b}{Color.i}{Color.sea_green}{fqn}{Color.r}{Color.b}{Color.ivory}'{Color.r}.")

    return {
        "config_path": cfg.path,
        "rules_files": len(yaml_files),
        "rules_pre_dedupe": pre_dedupe,
        "rules_post_dedupe": post_dedupe,
        "unique_check_ids": df.select("check_id").distinct().count(),
        "distinct_rule_run_pairs": df.select("check_id","run_config_name").distinct().count(),
        "target_table": fqn,
        "wrote_rows": wrote_rows,
        "write_mode": must(write_block.get("mode"), f"{fqn}.write.mode"),
    }


# -------------------------
# Entrypoint (local/dev)
# -------------------------
if __name__ == "__main__":
    spark = SparkSession.builder.getOrCreate()
    spark.conf.set("spark.sql.session.timeZone", "UTC")
    show_notebook_env(spark)
    cfg = ProjectConfig("resources/dqx_config.yaml", variables={})
    result = run_checks_loader(spark, cfg, notebook_idx=1, dry_run=False, validate_only=False)
    print(result)