Skip to content

feat(config): add deterministic fingerprint for workflow configs#587

Merged
nabinchha merged 8 commits intomainfrom
nmulepati/feat/584-config-fingerprint
Apr 30, 2026
Merged

feat(config): add deterministic fingerprint for workflow configs#587
nabinchha merged 8 commits intomainfrom
nmulepati/feat/584-config-fingerprint

Conversation

@nabinchha
Copy link
Copy Markdown
Contributor

@nabinchha nabinchha commented Apr 30, 2026

📋 Summary

Adds DataDesignerConfig.fingerprint() — a deterministic, content-addressable sha256 over the data-relevant portion of a workflow config. Identical configs hash identically across processes and Python versions; runtime / environment / post-generation fields don't change the hash; alias-keyed lookup tables and None/[] representations are canonicalized so semantically-identical configs built different ways agree.

🔗 Related Issue

Closes #584

🔄 Changes

✨ Added

  • DataDesignerConfig.fingerprint() -> dict[str, str | int] — the public entry point. Returns {config_hash, config_hash_algo, config_hash_version}.
  • Internal data_designer.config.fingerprint module that powers the method (canonicalization + sha256). Not re-exported from data_designer.config — implementation detail that may change.
  • tests/config/test_fingerprint.py — 40 tests covering determinism (in-process + cross-process), include/exclude boundaries, alias-keyed canonicalization, None/[] equivalence, and custom-column identity.

🔧 Behavior details

Identity-relevant (changing flips the hash):

  • columns — names, types, generator params, processors, validators, skip/drop flags. Column order is part of identity (DAG ordering).
  • model_configs — alias, model, provider, sampling-relevant inference params (temperature, top_p, max_tokens, extra_body). Sorted by alias before hashing.
  • tool_configs — alias, providers, allow_tools, max_tool_call_turns. Sorted by tool_alias before hashing.
  • seed_config — source path / sampling strategy / selection strategy.
  • constraints, top-level processors.

Excluded (changing does not flip the hash):

  • profilers (post-generation analysis).
  • model_configs[*].skip_health_check.
  • inference_parameters.{max_parallel_requests, timeout}.
  • tool_configs[*].timeout_sec (per-call timing knob).
  • HuggingFace seed token and endpoint (auth + env, not data identity).

Canonicalization (so semantically-equivalent inputs hash equivalently):

  • model_configs and tool_configs are sorted by alias before hashing.
  • None and [] collapse to "absent" for top-level optional collections (model_configs, tool_configs, constraints, processors) and for tool_configs[*].allow_tools.

Custom columns: the payload includes the function's __name__, __qualname__, __module__, generator_params, and the @custom_column_generator() decorator metadata (required_columns, side_effect_columns, model_aliases). Documented limitation: closures captured via factory functions share name/qualname/module/source and so fingerprint identically — fold captured state into generator_params if it's identity-relevant.

Versioned: config_hash_version = 1 is returned alongside the hash so future scheme changes can be detected as "unknown identity" rather than a definite mismatch.

Loud-failure determinism: no default= fallback in json.dumps — non-JSON-native values fail loudly instead of silently flipping the hash via repr().

🗑️ Removed (vs. earlier revisions of this PR)

  • The opt-in custom_column_source (L2) source-hashing path and its helpers. L2 had several silent footguns (closure-capture invisibility, L1≠L2 on configs with no custom columns, uncaught ValueError from inspect.unwrap() on cycles, same-__name__ collisions silently restored when getsource() failed). Strengthening L1 with qualname/module/decorator-metadata covers the realistic cases without the surprise.

🧪 Testing

  • make test-config passes (537 tests, 40 in test_fingerprint.py)
  • make test-engine passes (1917 tests)
  • make check-config clean (ruff format + lint)
  • Unit tests added/updated
  • E2E tests added/updated (N/A — pure config-layer addition)

✅ Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable) — N/A; behavior is documented in the module/method docstrings

Provides DataDesignerConfig.fingerprint() and a freestanding
fingerprint_config() helper that produce a content-addressable
sha256 hash of the data-relevant portion of a workflow config.
Identical configs hash identically across processes and Python
versions; fields that don't affect generated rows (tool_configs,
profilers, skip_health_check, max_parallel_requests, timeout,
HuggingFace seed token/endpoint) are excluded.

Custom column generators contribute their registered name and
generator_params (L1) by default; opt-in custom_column_source=True
also hashes inspect.getsource() of each generator (L2) and
degrades gracefully with a warning when the source can't be
retrieved. The normalization scheme is versioned via
CONFIG_HASH_VERSION so future changes can be detected as
"unknown identity" rather than mismatch.
…d seed strategies in fingerprint tests

Also document L1 __name__-collision and L2 whitespace-sensitivity limitations in fingerprint_config(), and drop the json.dumps default=str fallback so non-JSON-native values fail loudly instead of silently degrading determinism.
@nabinchha nabinchha requested a review from a team as a code owner April 30, 2026 17:25
@github-actions
Copy link
Copy Markdown
Contributor

Review of PR #587feat(config): add deterministic fingerprint for workflow configs

Summary

Adds DataDesignerConfig.fingerprint() (and a freestanding fingerprint_config() helper) that returns a content-addressable sha256 over the data-relevant portion of a workflow config. Identity-relevant and excluded fields are explicitly enumerated; the scheme is versioned (CONFIG_HASH_VERSION = 1); custom columns have two tiers (L1 = __name__ + params, L2 = source-hashed). The module is wired through the lazy-import surface in data_designer.config, and the feature ships with 419 lines of focused unit tests covering shape, determinism (in-process and across processes), include boundaries, exclude boundaries, custom-column L1/L2, and graceful degradation when source is unavailable.

Net: the design is sound, the boundaries are well-chosen, and the test coverage is among the best I've seen for a config-layer addition. Below are a few small things worth addressing before merging.

Findings

Correctness / behavior

  • CustomColumnConfig.serialize_generator_params uses model_dump(), not model_dump(mode="json") (column_configs.py:715-719, pre-existing). fingerprint_config goes out of its way to avoid default= on json.dumps so non-JSON-native values fail loudly instead of silently drifting — good — but a user who puts a datetime, UUID, or non-str Enum into their custom-column generator_params will now hit a TypeError from json.dumps inside fingerprint(). Two options:

    1. Switch the serializer to v.model_dump(mode="json") (preferred; also makes to_dict() round-trip cleaner).
    2. Call it out in the fingerprint_config docstring under Limitations.
      The PR doesn't introduce this bug, but it's the first caller that will surface it.
  • L1 uses bare __name__, not __qualname__ / module path (column_configs.py:711-713, pre-existing). Documented clearly as a limitation — fine. Worth noting that __qualname__ is a cheap win over __name__ (catches class-nested generators, still stable across processes) and wouldn't require bumping to L2. Not blocking.

  • Canary / version bump signal. CONFIG_HASH_VERSION = 1 is load-bearing for downstream consumers, but nothing fails loudly if someone changes _normalize_config_dict without bumping it. Consider adding one regression test that pins the hex digest of a known fixture config — e.g.:

    def test_fingerprint_canary(stub_data_designer_config):
        # If this hash changes, bump CONFIG_HASH_VERSION and update this value.
        assert stub_data_designer_config.fingerprint()["config_hash"] == "sha256:<known>"

    The cross-process determinism test proves stability now; a canary proves stability across commits.

  • Local seed-source paths are in the hash (not excluded). Two machines with the same file at different absolute paths will fingerprint differently. That's the right default (path is the only proxy we have for file identity), but it's worth a one-liner under Limitations in the docstring so users aren't surprised.

Code quality / style

  • _make_minimal_config(**overrides: object) in the test file — Any would match the project's convention for kwargs of heterogeneous type (AGENTS.md calls for modern typed code, and the rest of the test file uses Any). Trivial.

  • In test_fingerprint.py, the _hash wrapper does str(fingerprint_config(...)["config_hash"]). The str(...) is only there to coerce the str | int union returned by the typed dict — works, but a TypedDict on the return type of fingerprint_config would let callers access ["config_hash"] as str directly and drop the cast. Optional; the current typing is fine.

  • fingerprint.py comments are appropriately sparse and explain why (e.g., "No default= fallback" rationale on the json.dumps call). Matches STYLEGUIDE.md.

Imports / structural invariants

  • data_designer_config.py now eagerly imports fingerprint, which eagerly imports CustomColumnConfig from column_configs. column_configs is already loaded transitively through data_designer_config's own imports, so no new heavy dependency is pulled into the import graph. The new fingerprint symbols in __init__.py are correctly routed through the lazy-import surface for top-level consumers. Direction interface → engine → config is respected.

  • from __future__ import annotations present in both new files. SPDX headers present. Typed throughout.

Testing

  • Coverage is strong: each excluded field has a negative test (change doesn't affect hash), each included field has a positive test (change does affect hash), column reorder is tested, HF seed token/endpoint are tested against a real HuggingFaceSeedSource, L1 vs L2 are distinguished, and the OSError degradation path is exercised with caplog.

  • test_fingerprint_deterministic_across_processes uses subprocess.run([sys.executable, "-c", script]) — good choice for proving cross-process stability, but note it adds a few seconds to the test suite and will fail if the project isn't installed into the venv that sys.executable points at. In the agentic-CI environment make install-dev handles this; fine for now.

  • No E2E test, which the PR description correctly flags as N/A.

Security

  • No secret handling beyond excluding token / endpoint from the hash, which is correct (don't let auth material change the fingerprint). Tokens are not logged or included in warnings. No obvious injection/deserialization risk — json.dumps with a plain dict tree is safe.

Verdict

Approve with minor comments. The core design is right (data-identity vs runtime-identity split, versioned scheme, loud failure on non-JSON values, graceful L2 degradation) and the tests back it up thoroughly. Before merging I'd suggest:

  1. Decide whether to fix CustomColumnConfig.serialize_generator_params to use mode="json" or document the non-JSON-native limitation in the fingerprint docstring.
  2. Add a canary digest test so future normalization changes force an explicit CONFIG_HASH_VERSION bump.
  3. Briefly document the local-seed-path sensitivity in the Limitations block.

None of these are blocking; all are small additions to an otherwise well-scoped change.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 30, 2026

Greptile Summary

This PR adds DataDesignerConfig.fingerprint(), a deterministic SHA-256 content-addressable hash over the data-relevant portion of a workflow config. The implementation is well-reasoned: excluded fields are clearly documented, model_configs/tool_configs are sorted by alias for order-independence, and None/[] optional collections are canonicalized to "absent".

Confidence Score: 5/5

Safe to merge; the only finding is a P2 canonicalization gap for list ordering within ToolConfig.

All findings are P2 — a canonicalization inconsistency (unsorted allow_tools/providers) that does not cause crashes or incorrect data generation, only potentially different hashes for semantically-equivalent tool configs. No P0 or P1 issues found.

fingerprint.py — specifically _normalize_tool_config missing sorted() for providers and allow_tools.

Important Files Changed

Filename Overview
packages/data-designer-config/src/data_designer/config/fingerprint.py New fingerprinting module — well-structured canonicalization with one gap: allow_tools and providers lists inside ToolConfig are not sorted, so list-order differences produce different hashes for semantically-identical configs.
packages/data-designer-config/src/data_designer/config/data_designer_config.py Adds fingerprint() method that delegates to fingerprint_config; straightforward and correct.
packages/data-designer-config/tests/config/test_fingerprint.py 35 thorough tests covering determinism, include/exclude boundaries, canonicalization, and custom-column identity; no test for allow_tools/providers list-order invariance, which matches the gap in the implementation.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["DataDesignerConfig.fingerprint()"] --> B["config.to_dict()"]
    B --> C["_normalize_config_dict"]
    C --> C1["Drop profilers and runtime fields"]
    C --> C2["Collapse None and empty-list to absent"]
    C --> C3["Enrich custom columns\nwith qualname, module, metadata"]
    C --> C4["Normalize and sort model_configs by alias"]
    C --> C5["Normalize and sort tool_configs by alias"]
    C --> C6["Normalize seed_config\nremove auth fields"]
    C1 & C2 & C3 & C4 & C5 & C6 --> D["Canonical JSON dump"]
    D --> E["SHA-256 hex digest"]
    E --> F["config_hash + algo + version dict"]
Loading
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
packages/data-designer-config/src/data_designer/config/fingerprint.py:185-187
**`allow_tools` and `providers` not sorted before hashing**

`ToolConfig.providers` and `allow_tools` are `list[str]` fields whose semantics are order-independent (a set of MCP providers / permitted tool names), but they are serialized as-is into the hash payload. Two configs that are otherwise identical except for list ordering — e.g. `providers=["p1","p2"]` vs `providers=["p2","p1"]`, or `allow_tools=["search","list"]` vs `allow_tools=["list","search"]` — will produce different fingerprints despite representing the same data identity.

The PR already sorts `model_configs` by `alias` and `tool_configs` by `tool_alias` for exactly this reason; the same treatment is missing for these inner string lists.

```python
def _normalize_tool_config(tool_config: dict[str, Any]) -> dict[str, Any]:
    normalized = _drop_keys(tool_config, _EXCLUDED_TOOL_CONFIG_KEYS)
    normalized = _drop_empty_optional(normalized, _TOOL_CONFIG_OPTIONAL_COLLECTIONS)
    if isinstance(normalized.get("providers"), list):
        normalized["providers"] = sorted(normalized["providers"])
    if isinstance(normalized.get("allow_tools"), list):
        normalized["allow_tools"] = sorted(normalized["allow_tools"])
    return normalized
```

Reviews (7): Last reviewed commit: "test(fingerprint): pin closure-capture l..." | Re-trigger Greptile

The set of MCP tools an LLM column can call (providers, allow_tools,
max_tool_call_turns, tool_alias) shapes what the model produces, so
tool_configs is identity-relevant. Only timeout_sec is excluded,
mirroring how inference_parameters.timeout is treated as a runtime
knob rather than a data-identity field.

Updates the fingerprint_config docstring's Include/Exclude lists,
flips the existing tool_configs exclusion test, and adds coverage
for tool_alias / providers / allow_tools / max_tool_call_turns
inclusion plus timeout_sec exclusion.

Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>
Made-with: Cursor
Comment thread packages/data-designer-config/src/data_designer/config/fingerprint.py Outdated
Comment thread packages/data-designer-config/src/data_designer/config/fingerprint.py Outdated
Comment thread packages/data-designer-config/src/data_designer/config/data_designer_config.py Outdated
Comment thread packages/data-designer-config/src/data_designer/config/fingerprint.py Outdated
Copy link
Copy Markdown
Contributor

@johnnygreco johnnygreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together, @nabinchha — the include/exclude split, cross-process determinism test, and the L1/L2 framing for custom columns all read very cleanly.

Summary

Adds DataDesignerConfig.fingerprint() and a freestanding fingerprint_config() helper that produce a content-addressable sha256 over the data-relevant portion of a workflow config, with a versioned scheme and an opt-in custom_column_source=True (L2) mode that additionally hashes generator source. Implementation matches the PR description and the include/exclude boundaries described in the docstring.

Headline findings (inline comments below for the details)

Warnings — worth addressing before merge

  • L1 and L2 produce different hashes even when the config has no custom columns.
  • inspect.unwrap() cycle propagates as ValueError instead of degrading gracefully.
  • When L2 source retrieval fails, two distinct generators with the same __name__ collide silently.
  • Repeated WARNING on every fingerprint call (log-spam in hot paths).
  • None vs [] for optional top-level fields produces different hashes.

Suggestions — take or leave

  • Add the unhashable-source-falls-back-to-L1 caveat to the L2 limitations block.
  • Centralize the "what's excluded and why" so the five frozensets and the docstring prose can't drift.
  • Tighten test_custom_column_unhashable_source_degrades_gracefully to also assert hash stability across calls.

What Looks Good

  • Cross-process determinism test via subprocess.run is the right shape — would catch a repr()-with-memory-address regression immediately. Combined with the deliberate no-default= in json.dumps, loud-failure determinism is the right call.
  • INCLUDE/EXCLUDE test layout is clean and will scale as the exclusion list grows.
  • Versioning the scheme via CONFIG_HASH_VERSION upfront is exactly right — future scheme changes diagnose as "unknown identity" instead of silent mismatch.
  • Lazy-import wiring in config/__init__.py follows the existing pattern.

Verdict

Needs changes — the L1/L2-divergence and ValueError gaps are documented-vs-actual behavior mismatches worth fixing before merge. The None/[] canonicalization is a smaller footgun but easy to fix in the same pass.


This review was generated by an AI assistant.

Comment thread packages/data-designer-config/src/data_designer/config/fingerprint.py Outdated
Comment thread packages/data-designer-config/src/data_designer/config/fingerprint.py Outdated
Comment thread packages/data-designer-config/src/data_designer/config/fingerprint.py Outdated
Comment thread packages/data-designer-config/src/data_designer/config/fingerprint.py Outdated
Comment thread packages/data-designer-config/src/data_designer/config/fingerprint.py Outdated
Comment thread packages/data-designer-config/src/data_designer/config/fingerprint.py Outdated
Comment thread packages/data-designer-config/src/data_designer/config/fingerprint.py Outdated
Comment thread packages/data-designer-config/tests/config/test_fingerprint.py Outdated
Drops the opt-in `custom_column_source` (L2) source-hashing path and
addresses the canonicalization gaps the reviewers found.

L2 had several silent footguns: closures with different captured state
collapsed to the same hash, the empty `custom_column_sources: []` payload
key made L1 and L2 disagree even on configs with no custom columns,
`inspect.unwrap()` could raise `ValueError` on `__wrapped__` cycles
(uncaught), and same-`__name__` collisions silently came back when
`getsource()` failed. Removing it shrinks the public surface, deletes
~50 lines of helper code, and resolves seven review comments at once.

Strengthens L1 identity for custom columns: the payload now includes
`__qualname__`, `__module__`, and the `@custom_column_generator()`
decorator metadata (`required_columns`, `side_effect_columns`,
`model_aliases`) in addition to `__name__` + `generator_params`. This
disambiguates same-`__name__`-different-scope collisions and prevents
silently dropping DAG-affecting metadata.

Canonicalizes alias-keyed lookup tables and optional collections so
builder-API and YAML-loaded configs producing identical datasets
fingerprint identically:

* `model_configs` and `tool_configs` are sorted by alias before
  hashing (column order remains identity, since columns are DAG nodes).
* `None` and `[]` collapse to "absent" for top-level optional
  collections (`model_configs`, `tool_configs`, `constraints`,
  `processors`) and for `tool_configs[*].allow_tools`.

Consolidates the excluded-fields constants behind a single canonical
table comment and drops the Sphinx `:func:`/`:class:` roles in the
docstrings to match the rest of the codebase.

Test coverage adds order-independence tests for `model_configs` and
`tool_configs`, parametrized `None`-vs-`[]` equivalence tests for
all four optional top-level collections plus `allow_tools`,
qualname-disambiguation, and decorator-metadata change detection.

Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>
Made-with: Cursor
Function names should be action words; `_hash` is a noun. Rename the
test-only helper to `_compute_hash` to match its verb-form behavior
(it computes a hash from a config). No behavioral change.

Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>
Made-with: Cursor
@johnnygreco
Copy link
Copy Markdown
Contributor

Independently re-verified the fixes at 2debe120 — 39/39 tests pass, ruff clean, and I re-ran each scenario from the original review (qualname disambiguation, decorator-metadata sensitivity, None-vs-[] collapse across all four optional collections + allow_tools, alias-keyed order independence). All resolved.

Dropping L2 entirely and strengthening L1 with __qualname__ + __module__ + decorator metadata is a cleaner answer than patching L2 piecewise — the public surface gets smaller and the strongest collision class (same-__name__-different-scope) is now caught without getsource().

Two non-blocking follow-ups:

  1. PR description is stale. It still describes fingerprint(*, custom_column_source=False) with an opt-in L2 mode; the shipped method is fingerprint(self) and L2 is gone. Greptile's P1 was actually correct at the moment it fired. Worth a quick edit to the PR body before merge so future readers don't go hunting for an L2 path that doesn't exist.

  2. Closure-capture limitation is documented but not tested. Verified locally — make_gen(2) and make_gen(7) still hash identically, which matches the docstring's "Limitation:" paragraph. A small regression test that pins the current behavior would lock the documented contract in both directions (so a future change either keeps the limitation or has to delete the doc paragraph). Could be as simple as:

    def test_closure_captured_state_is_a_known_limitation() -> None:
        """Documented limitation: closures with different captured state still hash identically.
        If this test ever flips, update the Limitation section in fingerprint_config()."""
        def make_gen(factor: int):
            @custom_column_generator()
            def gen(row, generator_params=None):
                return str(row.get("x", 0) * factor)
            return gen
        assert _compute_hash(_make_custom_config(make_gen(2))) == _compute_hash(_make_custom_config(make_gen(7)))

Neither is blocking from my side.


Generated by an AI assistant.

johnnygreco
johnnygreco previously approved these changes Apr 30, 2026
The previous _hash -> _compute_hash blanket rename also caught the test
names that happen to end in "_hash()" (e.g. test_changing_X_changes_hash).
"hash" is a noun there — it describes what the test is about, not the
helper being called. Restore the original names; only the helper itself
stays renamed.

Add `test_closure_captured_state_is_a_known_limitation` per @johnnygreco's
approval follow-up: factory-built closures with different captured state
share __name__/__qualname__/__module__/source and so fingerprint
identically. Pin that behavior so a future change either keeps the
limitation or has to delete the matching docstring paragraph in lockstep.

Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>
Made-with: Cursor
@nabinchha
Copy link
Copy Markdown
Contributor Author

Thanks for the re-verification, @johnnygreco — both points addressed in 180ecb0:

  1. PR description: it was actually already up to date as of 28ed8e69 — the body shows fingerprint() -> dict[str, str | int] (no custom_column_source kwarg) and there's an explicit "Removed (vs. earlier revisions of this PR)" section noting L2 was dropped. You were probably reading a cached render. I just bumped the test counts from 35 → 40 / 536 → 537 to reflect the new closure-capture pin (and the canonicalization tests added in 28ed8e69).
  2. Closure-capture regression test: added test_closure_captured_state_is_a_known_limitation in 180ecb0, modeled on your suggestion. It pins the documented behavior (make_factor_gen(2) and make_factor_gen(7) hash identically) so a future change can't silently flip it without forcing an explicit decision on whether to keep or remove the docstring caveat.

Also folded in a fix from the same commit: my earlier _hash_compute_hash rename caught 26 test names that happened to end in _hash() (e.g. test_changing_column_name_changes_hash). "hash" is a noun in those names, so the rename was wrong there. Restored the original test names; only the helper itself stays as _compute_hash.

@nabinchha nabinchha requested a review from johnnygreco April 30, 2026 19:55
@nabinchha nabinchha merged commit 81033e6 into main Apr 30, 2026
57 checks passed
@nabinchha nabinchha deleted the nmulepati/feat/584-config-fingerprint branch April 30, 2026 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a deterministic hash to uniquely identify a workflow config

3 participants