feat(config): add deterministic fingerprint for workflow configs#587
feat(config): add deterministic fingerprint for workflow configs#587
Conversation
Provides DataDesignerConfig.fingerprint() and a freestanding fingerprint_config() helper that produce a content-addressable sha256 hash of the data-relevant portion of a workflow config. Identical configs hash identically across processes and Python versions; fields that don't affect generated rows (tool_configs, profilers, skip_health_check, max_parallel_requests, timeout, HuggingFace seed token/endpoint) are excluded. Custom column generators contribute their registered name and generator_params (L1) by default; opt-in custom_column_source=True also hashes inspect.getsource() of each generator (L2) and degrades gracefully with a warning when the source can't be retrieved. The normalization scheme is versioned via CONFIG_HASH_VERSION so future changes can be detected as "unknown identity" rather than mismatch.
…d seed strategies in fingerprint tests Also document L1 __name__-collision and L2 whitespace-sensitivity limitations in fingerprint_config(), and drop the json.dumps default=str fallback so non-JSON-native values fail loudly instead of silently degrading determinism.
Review of PR #587 —
|
Greptile SummaryThis PR adds
|
| Filename | Overview |
|---|---|
| packages/data-designer-config/src/data_designer/config/fingerprint.py | New fingerprinting module — well-structured canonicalization with one gap: allow_tools and providers lists inside ToolConfig are not sorted, so list-order differences produce different hashes for semantically-identical configs. |
| packages/data-designer-config/src/data_designer/config/data_designer_config.py | Adds fingerprint() method that delegates to fingerprint_config; straightforward and correct. |
| packages/data-designer-config/tests/config/test_fingerprint.py | 35 thorough tests covering determinism, include/exclude boundaries, canonicalization, and custom-column identity; no test for allow_tools/providers list-order invariance, which matches the gap in the implementation. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["DataDesignerConfig.fingerprint()"] --> B["config.to_dict()"]
B --> C["_normalize_config_dict"]
C --> C1["Drop profilers and runtime fields"]
C --> C2["Collapse None and empty-list to absent"]
C --> C3["Enrich custom columns\nwith qualname, module, metadata"]
C --> C4["Normalize and sort model_configs by alias"]
C --> C5["Normalize and sort tool_configs by alias"]
C --> C6["Normalize seed_config\nremove auth fields"]
C1 & C2 & C3 & C4 & C5 & C6 --> D["Canonical JSON dump"]
D --> E["SHA-256 hex digest"]
E --> F["config_hash + algo + version dict"]
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
packages/data-designer-config/src/data_designer/config/fingerprint.py:185-187
**`allow_tools` and `providers` not sorted before hashing**
`ToolConfig.providers` and `allow_tools` are `list[str]` fields whose semantics are order-independent (a set of MCP providers / permitted tool names), but they are serialized as-is into the hash payload. Two configs that are otherwise identical except for list ordering — e.g. `providers=["p1","p2"]` vs `providers=["p2","p1"]`, or `allow_tools=["search","list"]` vs `allow_tools=["list","search"]` — will produce different fingerprints despite representing the same data identity.
The PR already sorts `model_configs` by `alias` and `tool_configs` by `tool_alias` for exactly this reason; the same treatment is missing for these inner string lists.
```python
def _normalize_tool_config(tool_config: dict[str, Any]) -> dict[str, Any]:
normalized = _drop_keys(tool_config, _EXCLUDED_TOOL_CONFIG_KEYS)
normalized = _drop_empty_optional(normalized, _TOOL_CONFIG_OPTIONAL_COLLECTIONS)
if isinstance(normalized.get("providers"), list):
normalized["providers"] = sorted(normalized["providers"])
if isinstance(normalized.get("allow_tools"), list):
normalized["allow_tools"] = sorted(normalized["allow_tools"])
return normalized
```
Reviews (7): Last reviewed commit: "test(fingerprint): pin closure-capture l..." | Re-trigger Greptile
The set of MCP tools an LLM column can call (providers, allow_tools, max_tool_call_turns, tool_alias) shapes what the model produces, so tool_configs is identity-relevant. Only timeout_sec is excluded, mirroring how inference_parameters.timeout is treated as a runtime knob rather than a data-identity field. Updates the fingerprint_config docstring's Include/Exclude lists, flips the existing tool_configs exclusion test, and adds coverage for tool_alias / providers / allow_tools / max_tool_call_turns inclusion plus timeout_sec exclusion. Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com> Made-with: Cursor
johnnygreco
left a comment
There was a problem hiding this comment.
Thanks for putting this together, @nabinchha — the include/exclude split, cross-process determinism test, and the L1/L2 framing for custom columns all read very cleanly.
Summary
Adds DataDesignerConfig.fingerprint() and a freestanding fingerprint_config() helper that produce a content-addressable sha256 over the data-relevant portion of a workflow config, with a versioned scheme and an opt-in custom_column_source=True (L2) mode that additionally hashes generator source. Implementation matches the PR description and the include/exclude boundaries described in the docstring.
Headline findings (inline comments below for the details)
Warnings — worth addressing before merge
- L1 and L2 produce different hashes even when the config has no custom columns.
inspect.unwrap()cycle propagates asValueErrorinstead of degrading gracefully.- When L2 source retrieval fails, two distinct generators with the same
__name__collide silently. - Repeated WARNING on every fingerprint call (log-spam in hot paths).
Nonevs[]for optional top-level fields produces different hashes.
Suggestions — take or leave
- Add the unhashable-source-falls-back-to-L1 caveat to the L2 limitations block.
- Centralize the "what's excluded and why" so the five frozensets and the docstring prose can't drift.
- Tighten
test_custom_column_unhashable_source_degrades_gracefullyto also assert hash stability across calls.
What Looks Good
- Cross-process determinism test via
subprocess.runis the right shape — would catch arepr()-with-memory-address regression immediately. Combined with the deliberate no-default=injson.dumps, loud-failure determinism is the right call. - INCLUDE/EXCLUDE test layout is clean and will scale as the exclusion list grows.
- Versioning the scheme via
CONFIG_HASH_VERSIONupfront is exactly right — future scheme changes diagnose as "unknown identity" instead of silent mismatch. - Lazy-import wiring in
config/__init__.pyfollows the existing pattern.
Verdict
Needs changes — the L1/L2-divergence and ValueError gaps are documented-vs-actual behavior mismatches worth fixing before merge. The None/[] canonicalization is a smaller footgun but easy to fix in the same pass.
This review was generated by an AI assistant.
Drops the opt-in `custom_column_source` (L2) source-hashing path and addresses the canonicalization gaps the reviewers found. L2 had several silent footguns: closures with different captured state collapsed to the same hash, the empty `custom_column_sources: []` payload key made L1 and L2 disagree even on configs with no custom columns, `inspect.unwrap()` could raise `ValueError` on `__wrapped__` cycles (uncaught), and same-`__name__` collisions silently came back when `getsource()` failed. Removing it shrinks the public surface, deletes ~50 lines of helper code, and resolves seven review comments at once. Strengthens L1 identity for custom columns: the payload now includes `__qualname__`, `__module__`, and the `@custom_column_generator()` decorator metadata (`required_columns`, `side_effect_columns`, `model_aliases`) in addition to `__name__` + `generator_params`. This disambiguates same-`__name__`-different-scope collisions and prevents silently dropping DAG-affecting metadata. Canonicalizes alias-keyed lookup tables and optional collections so builder-API and YAML-loaded configs producing identical datasets fingerprint identically: * `model_configs` and `tool_configs` are sorted by alias before hashing (column order remains identity, since columns are DAG nodes). * `None` and `[]` collapse to "absent" for top-level optional collections (`model_configs`, `tool_configs`, `constraints`, `processors`) and for `tool_configs[*].allow_tools`. Consolidates the excluded-fields constants behind a single canonical table comment and drops the Sphinx `:func:`/`:class:` roles in the docstrings to match the rest of the codebase. Test coverage adds order-independence tests for `model_configs` and `tool_configs`, parametrized `None`-vs-`[]` equivalence tests for all four optional top-level collections plus `allow_tools`, qualname-disambiguation, and decorator-metadata change detection. Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com> Made-with: Cursor
Function names should be action words; `_hash` is a noun. Rename the test-only helper to `_compute_hash` to match its verb-form behavior (it computes a hash from a config). No behavioral change. Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com> Made-with: Cursor
|
Independently re-verified the fixes at Dropping L2 entirely and strengthening L1 with Two non-blocking follow-ups:
Neither is blocking from my side. Generated by an AI assistant. |
The previous _hash -> _compute_hash blanket rename also caught the test names that happen to end in "_hash()" (e.g. test_changing_X_changes_hash). "hash" is a noun there — it describes what the test is about, not the helper being called. Restore the original names; only the helper itself stays renamed. Add `test_closure_captured_state_is_a_known_limitation` per @johnnygreco's approval follow-up: factory-built closures with different captured state share __name__/__qualname__/__module__/source and so fingerprint identically. Pin that behavior so a future change either keeps the limitation or has to delete the matching docstring paragraph in lockstep. Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com> Made-with: Cursor
|
Thanks for the re-verification, @johnnygreco — both points addressed in 180ecb0:
Also folded in a fix from the same commit: my earlier |
📋 Summary
Adds
DataDesignerConfig.fingerprint()— a deterministic, content-addressable sha256 over the data-relevant portion of a workflow config. Identical configs hash identically across processes and Python versions; runtime / environment / post-generation fields don't change the hash; alias-keyed lookup tables andNone/[]representations are canonicalized so semantically-identical configs built different ways agree.🔗 Related Issue
Closes #584
🔄 Changes
✨ Added
DataDesignerConfig.fingerprint() -> dict[str, str | int]— the public entry point. Returns{config_hash, config_hash_algo, config_hash_version}.data_designer.config.fingerprintmodule that powers the method (canonicalization + sha256). Not re-exported fromdata_designer.config— implementation detail that may change.tests/config/test_fingerprint.py— 40 tests covering determinism (in-process + cross-process), include/exclude boundaries, alias-keyed canonicalization,None/[]equivalence, and custom-column identity.🔧 Behavior details
Identity-relevant (changing flips the hash):
columns— names, types, generator params, processors, validators, skip/drop flags. Column order is part of identity (DAG ordering).model_configs— alias, model, provider, sampling-relevant inference params (temperature, top_p, max_tokens, extra_body). Sorted by alias before hashing.tool_configs— alias, providers, allow_tools, max_tool_call_turns. Sorted by tool_alias before hashing.seed_config— source path / sampling strategy / selection strategy.constraints, top-levelprocessors.Excluded (changing does not flip the hash):
profilers(post-generation analysis).model_configs[*].skip_health_check.inference_parameters.{max_parallel_requests, timeout}.tool_configs[*].timeout_sec(per-call timing knob).tokenandendpoint(auth + env, not data identity).Canonicalization (so semantically-equivalent inputs hash equivalently):
model_configsandtool_configsare sorted by alias before hashing.Noneand[]collapse to "absent" for top-level optional collections (model_configs,tool_configs,constraints,processors) and fortool_configs[*].allow_tools.Custom columns: the payload includes the function's
__name__,__qualname__,__module__,generator_params, and the@custom_column_generator()decorator metadata (required_columns,side_effect_columns,model_aliases). Documented limitation: closures captured via factory functions share name/qualname/module/source and so fingerprint identically — fold captured state intogenerator_paramsif it's identity-relevant.Versioned:
config_hash_version = 1is returned alongside the hash so future scheme changes can be detected as "unknown identity" rather than a definite mismatch.Loud-failure determinism: no
default=fallback injson.dumps— non-JSON-native values fail loudly instead of silently flipping the hash viarepr().🗑️ Removed (vs. earlier revisions of this PR)
custom_column_source(L2) source-hashing path and its helpers. L2 had several silent footguns (closure-capture invisibility, L1≠L2 on configs with no custom columns, uncaughtValueErrorfrominspect.unwrap()on cycles, same-__name__collisions silently restored whengetsource()failed). Strengthening L1 with qualname/module/decorator-metadata covers the realistic cases without the surprise.🧪 Testing
make test-configpasses (537 tests, 40 intest_fingerprint.py)make test-enginepasses (1917 tests)make check-configclean (ruff format + lint)✅ Checklist