Skip to content

feat: add workflow chaining#636

Merged
andreatgretel merged 16 commits into
mainfrom
andreatgretel/feat/workflow-chaining
May 18, 2026
Merged

feat: add workflow chaining#636
andreatgretel merged 16 commits into
mainfrom
andreatgretel/feat/workflow-chaining

Conversation

@andreatgretel
Copy link
Copy Markdown
Contributor

@andreatgretel andreatgretel commented May 12, 2026

📋 Summary

Adds workflow chaining so users can split a generation pipeline into explicit Data Designer stages, hand results between stages on disk, and keep shared runtime state such as throttling on one parent DataDesigner. The latest slice also makes processor-expressible stage transforms first class through seeded processor-only configs, stage output_processors, output="processor:<name>", and selected-output export.

🔗 Related Issue

Related to #552

🗺️ Plan Progress

Tracking against plans/workflow-chaining/workflow-chaining.md:

✅ Done in this PR

  • Phase 1 core: CompositeWorkflow, DataDesigner.compose_workflow(), linear disk handoff through LocalFileSeedSource, deterministic stage artifact directories, workflow metadata, on_success callbacks, empty-output handling, and stage result inspection.
  • Notebook and async sidecars: DatasetCreationResults.to_config_builder(), PreviewResults.to_config_builder(), DataDesigner.acreate(), shared throttle-manager reuse, and concurrent-safe seed/resource-provider handling.
  • Processor boundary extensions: seeded processor-only configs, output_processors=[...], output="processor:<name>", selected-output load_dataset()/count_records()/export(), and validation for invalid/no-output cases.

⏳ Still Missing

  • Phase 2 follow-through: complete allow_resize removal, docs updates, migration examples, and fail-fast row-count-change guards.
  • Phase 3: stage-level resume using fingerprints, ResumeMode, metadata validation, and callback output path checks.
  • Phase 4: DAG-shaped stages, parallel branches via acreate(), and an explicit join/merge contract.
  • Later extensions: config bundles for built-in/plugin stage patterns, broader first-class artifact/media seeding, selected-stage/full-workflow export/push semantics, and custom processors as the structured replacement for arbitrary callback transforms.

🔄 Changes

✨ Added

  • Add DataDesigner.compose_workflow() and CompositeWorkflow for linear stage chaining with deterministic artifacts/<workflow>/stage-N-name directories.
  • Add workflow metadata with stage status, config payloads, fingerprints, counts, seed paths, callback paths, output processors, selected output, and duration.
  • Add DatasetCreationResults.to_config_builder() and PreviewResults.to_config_builder() for lightweight notebook chaining.
  • Add DataDesigner.acreate() for awaitable creates, including shared-instance asyncio.gather(...) coverage.
  • Add seeded processor-only config support, so a workflow stage can be just processors seeded from upstream output.
  • Add output_processors=[ProcessorConfig, ...] to CompositeWorkflow.add_stage() for inline stage-boundary transforms.
  • Add output="processor:<name>" so downstream stages and final load_dataset()/count_records()/export() can read named processor artifacts.
  • Add load_stage_output(), count_stage_output_records(), and get_stage_output_path() so callers can inspect the actual selected output handed downstream.

🔧 Changed

  • Share the parent DataDesigner throttle manager across stage runs and gathered async creates.
  • Relax the config schema enough for seeded processor-only configs, then validate at the engine boundary that they are seeded and leave at least one output column.
  • Keep normal seeded generation profiling unchanged, while allowing processor-only runs to profile injected seed columns.
  • Validate output="processor:<name>" at add_stage() so typos fail before the workflow runs.
  • Keep set_run_config() throttle overrides effective by rebuilding the shared throttle manager when runtime config changes.

🐛 Fixed

  • Avoid mutating the parent DataDesigner artifact root during workflow runs.
  • Copy seed readers and resource-provider state per create call so concurrent creates do not share mutable runtime readers.
  • Reject push_to_hub() for selected processor/callback outputs instead of silently pushing the raw final-stage artifact.
  • Resolve selected processor output correctly when a stage also has output processors, and pass callbacks the main stage artifact directory.

📚 Docs

  • Update plans/workflow-chaining/workflow-chaining.md with implemented scope, remaining phases, and the output-processor/output-selection direction.
  • Add an experimental Workflow Chaining concept page to MkDocs and Fern latest navigation.

🧪 Tests

  • Add interface coverage for chained runs, callbacks, empty upstream handling, custom generators, async create, selected-output export/push behavior, processor-only stages, output processors, and processor output handoff.
  • Add validation coverage for seeded processor-only configs and all-output-dropped processor configs.

Demo

import data_designer.config as dd
from data_designer.interface import DataDesigner

data_designer = DataDesigner()

# Assume fast_model has alias="fast".
draft_qa = (
    dd.DataDesignerConfigBuilder(model_configs=[fast_model])
    .with_seed_dataset(dd.LocalFileSeedSource(path="parsed_docs/*.parquet"))
    .add_column(
        name="chunk_summary",
        column_type="llm_text",
        model_alias="fast",
        prompt="Summarize this source passage:\n\n{{ text }}",
    )
    .add_column(
        name="question",
        column_type="llm_text",
        model_alias="fast",
        prompt="Write a precise training question about this passage:\n\n{{ chunk_summary }}",
    )
    .add_column(
        name="answer",
        column_type="llm_text",
        model_alias="fast",
        prompt="Answer the question using only the passage.\n\nQuestion: {{ question }}\nPassage: {{ text }}",
    )
)

sft_export = (
    dd.DataDesignerConfigBuilder()
    .add_processor(
        dd.SchemaTransformProcessorConfig(
            name="chatml",
            template={
                "messages": [
                    {"role": "system", "content": "Answer using the provided source passage."},
                    {"role": "user", "content": "{{ question }}"},
                    {"role": "assistant", "content": "{{ answer }}"},
                ],
                "source": "{{ source_uri }}",
            },
        )
    )
)

workflow = data_designer.compose_workflow(name="doc-qa-sft")

workflow.add_stage(
    "draft_qa",
    draft_qa,
    num_records=500,
    output_processors=[
        dd.DropColumnsProcessorConfig(
            name="drop_generation_context",
            column_names=["text", "chunk_summary"],
        ),
    ],
)
workflow.add_stage("sft_export", sft_export, output="processor:chatml")

results = workflow.run()
training_rows = results.load_dataset()       # Reads processors-files/chatml/*.parquet.
results.export("chatml.jsonl")               # Exports the same selected workflow output.
drafts = results["draft_qa"].load_dataset()  # Raw text and scratch summary are already dropped.
stage_output = results.load_stage_output("sft_export")
stage_artifact = results["sft_export"].load_dataset()
chatml_artifact = results["sft_export"].load_processor_dataset("chatml")

This displays the new composition path:

  • output_processors remove bulky generation context before the handoff.
  • The second stage is processor-only and does not need a generator column.
  • output="processor:chatml" makes the schema-transform artifact the final workflow output for load_dataset(), count_records(), and export().
  • load_stage_output("sft_export") reads what the stage exposed downstream, while results["sft_export"] still exposes the stage's normal result and processor artifacts.

🔍 Attention Areas

Reviewers: Please pay special attention to the following:

  • composite_workflow.py - New public workflow API, stage metadata/fingerprints, output-processor execution, and selected-output semantics.
  • data_designer_config.py and validation.py - Config schema and boundary validation now allow seeded processor-only configs while rejecting invalid/no-output cases.
  • data_designer.py - acreate(), shared throttle-manager wiring, explicit artifact-root handling, and processor-only profiling fallback.
  • throttle_manager.py - Shared throttle behavior across sequential workflow stages and concurrent async creates.

🧪 Testing

  • make test passes
  • Unit tests added/updated
  • E2E tests added/updated - temp smoke pack added under /tmp/dd-workflow-smokes

Validation run:

  • make test - 3320 passed, 158 warnings
  • .venv/bin/ruff check --fix .
  • .venv/bin/ruff format .
  • .venv/bin/pytest packages/data-designer/tests/interface -q - 137 passed, 30 warnings
  • .venv/bin/pytest packages/data-designer/tests/interface/test_composite_workflow.py -q - 43 passed
  • .venv/bin/pytest packages/data-designer-engine/tests/engine/resources/test_seed_reader.py -q - 43 passed
  • .venv/bin/pytest packages/data-designer/tests/interface packages/data-designer-engine/tests/engine/test_validation.py packages/data-designer-engine/tests/engine/resources/test_seed_reader.py -q - 190 passed, 30 warnings
  • .venv/bin/pytest /home/ubuntu/Code/reviews/DataDesigner-pr-636/smoke_test.py -q -v - 2 passed
  • .venv/bin/pytest /tmp/dd-workflow-smokes/test_workflow_chaining_smokes.py -q -s - 10 passed
  • .venv/bin/pytest packages/data-designer-engine/tests/engine/test_validation.py packages/data-designer-config/tests/config/test_config_builder.py packages/data-designer-config/tests/config/test_fingerprint.py packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py -q - 206 passed, 39 warnings
  • make check-fern-docs - passed, Fern check 0 errors and 2 warnings
  • .venv/bin/mkdocs build - passed

Latest smoke rerun after the output_processors rename:

  • /home/ubuntu/Code/reviews/DataDesigner-pr-636/smoke_test.py - 2 passed
  • /tmp/dd-workflow-smokes/test_workflow_chaining_smokes.py - 10 passed, including the real NVIDIA provider smoke with 2 successful requests and no failures

✅ Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

Description updated with AI

@andreatgretel andreatgretel force-pushed the andreatgretel/feat/workflow-chaining branch from d8327e0 to 4950ed5 Compare May 14, 2026 19:58
@andreatgretel andreatgretel marked this pull request as ready for review May 14, 2026 22:26
@andreatgretel andreatgretel requested a review from a team as a code owner May 14, 2026 22:26
@github-actions
Copy link
Copy Markdown
Contributor

Review: PR #636 — feat: add workflow chaining

Summary

This PR introduces a substantial new feature: composite workflows for chaining multi-stage Data Designer runs with disk handoffs, plus several supporting changes. Concretely:

  • New CompositeWorkflow / CompositeWorkflowResults classes (composite_workflow.py, ~520 LoC) created via DataDesigner.compose_workflow(name=...).
  • Supporting plumbing: DataDesigner.acreate(), shared ThrottleManager across create() calls, optional per-call artifact_path, to_config_builder() on DatasetCreationResults / PreviewResults.
  • Schema relaxation: DataDesignerConfig.columns no longer requires min_length=1. Engine-side validation now allows seeded processor-only configs and accounts for processor-dropped columns.
  • New ProcessorConfig postprocessors and output="processor:<name>" selection at stage boundaries.
  • SeedReaderRegistry.get_reader() now returns a shallow-copied reader (concurrency safety for attach()).
  • ~1,000 lines of new test coverage.

Changes are scoped within the documented dependency direction (interface → engine → config). No reverse imports introduced.

Findings

Code correctness

  • composite_workflow.py:111,118,134,138 — discarded property access for side effects.
    CompositeWorkflowResults.load_dataset, count_records, export, and push_to_hub call self.final_result purely to trigger the SkippedStageResult raise, then go on to use load_stage_output(...). This is functionally correct but unusual; a linter or future contributor may delete the bare self.final_result line as dead code. Consider extracting _assert_final_not_skipped() or naming the bound result and passing its final_dataset_path through where appropriate.

  • composite_workflow.py:578 — fall-through on zero upstream rows.
    num_records = stage.num_records or previous_output_records or DEFAULT_NUM_RECORDS. With current logic this is safe (a zero-record upstream sets skipped_upstream_stage and short-circuits the next iteration before this line), but it relies on a non-local invariant. A defensive previous_output_records or None (already truthy-checked) and a comment explaining "we never reach here when upstream is empty" would prevent regressions if the skip logic is later changed.

  • composite_workflow.py:530 — direct access to _artifact_path on parent DataDesigner.
    Same package, so private access is acceptable, but it would be cleaner to expose a single read-only property and use it for both the workflow root and the per-call artifact_path parameter.

  • composite_workflow.py:660shutil.rmtree(stage_path) without dry-run/confirmation.
    The path is <artifact_root>/<workflow_name>/stage-<i>-<name>. If a user typo'd a workflow name to collide with another folder under artifacts/, content would be wiped silently on every run(). The dir-name validation already rejects path separators, but a final guard like "path must be a subdirectory of workflow_path" check would harden this.

  • composite_workflow.py:794_load_parquet_dataset uses pd.concat([read_parquet(...)]) eagerly.
    Loads all parquet files into RAM. Acceptable as v1 since the existing ArtifactStorage.load_dataset() does the same thing, but worth noting for users with large workflow outputs. Consider documenting that load_stage_output() is a convenience for inspection rather than streaming.

  • seed_reader.py:677 — shallow copy() of the registered reader.
    This fixes the per-call attach() race. However, copy() is a shallow copy, so any nested mutable state on the reader instance (lists, dicts, cached parse contexts) is still shared. Looking at SeedReader subclasses, self._parse_context is reset via _PARSE_CONTEXT_UNSET so it does get re-derived per-attach, but this is a fragile invariant — a future field on the base reader that mutates in-place on the shared instance could re-introduce the race. Worth at least a code comment on get_reader documenting why a copy (not deepcopy) is sufficient, with the implicit contract that reader subclasses must not store mutable shared state.

  • composite_workflow.py:565-573 — silent override of explicit seed config.
    When a stage has its own seed config and the workflow seeds from upstream, the upstream wins and only a logger.warning is emitted. This may be the right call (workflows-author-knows-best), but a strict mode (e.g., _WorkflowStage(allow_seed_override=False)) for users who'd rather see the typo as an error would be a small follow-up.

Tests

  • Coverage looks solid: chained runs, callbacks (including row-expansion), empty-output handling, postprocessors, processor-output selection, processor-only stages, custom generators, async parallelism, throttle-manager reuse, and (importantly) negative cases for invalid names/outputs.
  • Gap: I don't see direct coverage of sampling_strategy / selection_strategy parameters on add_stage(). They're plumbed through to with_seed_dataset() but unverified — at minimum a MagicMock assertion on the with_seed_dataset kwargs would prevent regressions.
  • Gap: No explicit test that an empty stage with allow_empty=False (the default) raises DataDesignerWorkflowError. The "empty stages can skip downstream" test exercises the allow_empty=True branch only.
  • test_acreate_does_not_serialize_create_calls is well-constructed (uses threading.Event + assert both_started.wait() to verify true parallelism, not just absence of serialization).

Project conventions

  • from __future__ import annotations and absolute imports — consistent throughout new files. ✅
  • Lazy heavy imports (lazy.pq, lazy.pd) are used in workflow utility helpers. ✅
  • The new composite_workflow is exposed through the lazy-imports map in interface/__init__.py, matching existing pattern. ✅
  • Type annotations: I noticed one slightly weak annotation — column_configs: list | None = None in DataDesigner._create_dataset_profiler (data_designer.py:639). The element type is omitted; should be list[ColumnConfigT] | None (or whatever the existing column-configs alias is) per the styleguide's "modern syntax: list[str]" rule.
  • Naming: count_stage_output_records reads slightly awkward next to load_stage_output and get_stage_output_path (verb is at the front in the latter two but middle in the former). Minor.

Performance

  • Per-stage workflow metadata is rewritten from scratch on each status update (status: runningrunning again with config payload → completed). For a 20-stage workflow this is fine; for a hypothetical large stage list it's O(N²) writes. Probably not worth changing.
  • The shared ThrottleManager is genuinely reused across stages and async creates, which is the documented invariant — verified by test_data_designer_reuses_throttle_manager_across_create_calls.

Security / risk

  • No new external network calls or secrets handling. ✅
  • on_success callback returns a path that is then read with _count_parquet_records / _local_seed_source_from_path. The path is user-supplied code's return value; if it points outside the workflow tree, the workflow happily seeds from it. That's by design (the demo "expand turns" callback writes wherever it wants), but worth noting in docs that callback authors are trusted.
  • _validate_dir_name blocks /, \, .. is not on its blocklist. Stage name .. would currently pass (.. has no path-separator chars but resolves outwards). Recommend adding .. and leading dots to the blocklist, or using Path(name).resolve().is_relative_to(workflow_path) after building the path.

Backward compatibility

  • DataDesignerConfig.columns lost its min_length=1 constraint. Existing configs continue to work; previously-rejected empty-column configs now reach the engine-side validator, which still rejects them unless they're seeded + have processors. No behavioral regression for valid inputs.
  • New optional artifact_path parameter on DataDesigner.create() is keyword-only and defaults to None → unchanged behavior.

Verdict

Approve with minor follow-ups. The implementation is well-structured, tests are extensive and exercise both the mocked-fast and real-engine paths, and the architecture respects the project's layering invariants. The biggest opportunities for follow-up are:

  1. Replace the discarded-property-access idiom in CompositeWorkflowResults with a named guard helper.
  2. Add .. / leading-dot rejection to _validate_dir_name, plus a hard relative-path check before shutil.rmtree.
  3. Add tests for sampling_strategy / selection_strategy and for the default allow_empty=False rejection path.
  4. Tighten the column_configs: list | None annotation in _create_dataset_profiler.
  5. Document the "no shared mutable state" contract on SeedReader subclasses next to get_reader's shallow copy.

None of these block the merge — they'd be reasonable cleanup commits or a follow-up PR.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 14, 2026

Greptile Summary

This PR introduces experimental workflow chaining for Data Designer, allowing multi-stage generation pipelines where each stage hands off its output on disk to the next. The implementation spans a new CompositeWorkflow public API, DataDesigner.acreate(), shared throttle manager reuse across stages, stage-level output_processors, output=\"processor:<name>\" selection, and relaxed config schema for seeded processor-only stages.

  • New CompositeWorkflow in composite_workflow.py orchestrates linear stage chaining with deterministic artifact directories, per-stage fingerprints, workflow metadata, and on_success callbacks; DataDesigner.compose_workflow() is the entry point.
  • Seeded processor-only configs are now valid: DataDesignerConfig.columns drops the min_length=1 constraint, and boundary validation in validation.py now allows seed columns to satisfy the "at least one remaining column" check when processor configs are present.
  • Shared infrastructure — the parent DataDesigner throttle manager propagates into every stage create call, SeedReaderRegistry.get_reader() shallow-copies readers before attach() to prevent concurrent-create aliasing, and acreate() wraps synchronous create() via asyncio.to_thread.

Confidence Score: 4/5

Safe to merge with one fix: duplicate names within output_processors silently corrupt stage output.

The workflow chaining logic, throttle-manager sharing, and seeded-processor-only validation path are all well-reasoned and covered by tests. The one concrete defect is that _validate_distinct_output_processors does not detect duplicates within the output_processors list itself — two processors with the same name would overwrite each other's directory and deliver wrong data to any caller using load_stage_output() or load_processor_dataset() for that name.

composite_workflow.py — specifically _validate_distinct_output_processors and its companion test coverage

Important Files Changed

Filename Overview
packages/data-designer/src/data_designer/interface/composite_workflow.py New file implementing the core workflow chaining API; logic for stage execution, output_processors, output selection, and fingerprinting is sound, but _validate_distinct_output_processors does not guard against duplicate names within the output_processors list itself.
packages/data-designer/src/data_designer/interface/data_designer.py Adds acreate() (asyncio.to_thread wrapper), compose_workflow(), shared throttle_manager propagation into create_resource_provider, and set_run_config throttle rebuild — all straightforward and correct.
packages/data-designer-engine/src/data_designer/engine/validation.py validate_columns_not_all_dropped now counts non-dropped seed columns toward remaining_cols when processor_configs are present, enabling seeded processor-only configs; the logic correctly handles the empty-processor-list case via falsy check.
packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py Adds shallow copy of the seed reader before attach() to prevent concurrent creates from sharing mutable top-level state; constraint that subclasses must not hold nested mutable state is documented in the comment.
packages/data-designer-config/src/data_designer/config/data_designer_config.py Removes min_length=1 from columns field, allowing empty-column configs for seeded processor-only stages; boundary validation in the engine enforces the invariant instead.
packages/data-designer/tests/interface/test_composite_workflow.py Comprehensive test coverage for stage chaining, callbacks, empty upstream, output_processors, processor output selection, async create, and validation rejections; no test for duplicates within the output_processors list.

Sequence Diagram

sequenceDiagram
    participant User
    participant CW as CompositeWorkflow
    participant DD as DataDesigner

    User->>CW: workflow.add_stage(name, builder, output_processors, output)
    Note over CW: Validate output, distinct names, dir names
    User->>CW: workflow.run()
    loop For each stage
        CW->>CW: Clone config_builder, inject upstream seed path
        CW->>DD: "create(stage_builder, artifact_path=workflow_path)"
        DD-->>CW: DatasetCreationResults (result)
        CW->>CW: "actual_records = result.count_records()"
        alt output_processors present
            CW->>DD: "create(output_processor_builder, num_records=actual_records)"
            DD-->>CW: DatasetCreationResults (output_result)
            CW->>CW: _select_output_result() to get output_source_result
        end
        alt on_success callback set
            CW->>CW: "output_seed_path = on_success(result.base_dataset_path)"
        else
            CW->>CW: "output_seed_path = _resolve_stage_output_path(output_source_result, output)"
        end
        CW->>CW: "stage_results[name] = output_result"
        CW->>CW: "stage_output_paths[name] = output_seed_path"
        CW->>CW: _write_workflow_metadata(workflow_path, metadata)
    end
    CW-->>User: CompositeWorkflowResults
    User->>CW: results.load_stage_output(name) reads from stage_output_paths
    User->>CW: results[name].load_dataset() reads from output_result.artifact_storage
Loading
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
packages/data-designer/src/data_designer/interface/composite_workflow.py:557-565
`_validate_distinct_output_processors` checks for name collisions between the stage's inline processors and `output_processors`, but it does not check for duplicate names *within* `output_processors` itself. If two entries share the same name (e.g., two `DropColumnsProcessorConfig(name="drop")` instances), both would write to the same `processors_outputs_path / "drop"` directory, with the second run silently overwriting the first. Any call to `load_stage_output()` or `load_processor_dataset()` referencing that name would then return the output of whichever processor ran last, not the one the author intended.

```suggestion
def _validate_distinct_output_processors(
    config_builder: DataDesignerConfigBuilder,
    output_processors: list[ProcessorConfig],
) -> None:
    stage_processor_names = {processor.name for processor in config_builder.get_processor_configs()}
    duplicate_names = stage_processor_names.intersection(processor.name for processor in output_processors)
    if duplicate_names:
        names = ", ".join(sorted(duplicate_names))
        raise DataDesignerWorkflowError(f"Output processor names must be distinct from stage processor names: {names}.")
    output_processor_names = [processor.name for processor in output_processors]
    seen: set[str] = set()
    duplicates_within = {name for name in output_processor_names if name in seen or seen.add(name)}  # type: ignore[func-returns-value]
    if duplicates_within:
        names = ", ".join(sorted(duplicates_within))
        raise DataDesignerWorkflowError(f"Output processor names must be unique within output_processors: {names}.")
```

Reviews (7): Last reviewed commit: "fix: address workflow review nits" | Re-trigger Greptile

Comment thread packages/data-designer/src/data_designer/interface/composite_workflow.py Outdated
Signed-off-by: Andre Manoel <amanoel@nvidia.com>
Signed-off-by: Andre Manoel <amanoel@nvidia.com>
@nabinchha
Copy link
Copy Markdown
Contributor

Thanks for putting this together, @andreatgretel — a lot of moving parts here (compose API, async, throttle reuse, processor-only configs, output selection) and the design coheres pretty well.

Summary

Adds CompositeWorkflow for chaining DataDesigner.create() runs with on-disk handoff, deterministic stage directories, shared throttle reuse, and structured stage-boundary transforms via postprocessors=[...] and output="processor:<name>". Also adds an async acreate() sidecar, to_config_builder() helpers, and validation/config relaxation to support seeded processor-only configs. The implementation matches the PR description and the linked plan; the two prior P1/P2 findings (validation regression, export glob divergence) are addressed in 9d9d1d1 and 77e7297 respectively.

Findings

Warnings — Worth addressing

packages/data-designer/src/data_designer/interface/composite_workflow.py:299-301_count_parquet_records leaks FileNotFoundError for bad callback paths

  • What: _count_parquet_records(callback_output_path) is invoked on whatever path the user-provided on_success callback returns. If the callback returns a non-existent file (or even just a misspelled directory that doesn't exist), _parquet_files() falls into the not path.is_dir() branch, returns [path], and lazy.pq.read_metadata(path) then raises a raw FileNotFoundError / pa.ArrowInvalid from PyArrow.
  • Why: Per AGENTS.md ("Errors normalize at boundaries") and STYLEGUIDE.md ("Wrap third-party exceptions at module boundaries"), user-facing callers should see DataDesignerWorkflowError rather than a leaked PyArrow/OS error. The user has no easy way to tell whether the error came from their callback returning the wrong thing or from a more fundamental engine issue.
  • Suggestion: Validate the callback's return value (exists() / readable) before counting, and normalize the failure:
    if stage.on_success is not None:
        callback_output_path = Path(stage.on_success(result.artifact_storage.base_dataset_path))
        if not callback_output_path.exists():
            raise DataDesignerWorkflowError(
                f"Stage {stage.name!r} on_success returned non-existent path: {callback_output_path}."
            )
        output_seed_path = callback_output_path
        output_records = _count_parquet_records(callback_output_path)
    Alternatively, wrap the inner read in _count_parquet_records so the same normalization helps the "empty processor output directory" edge case too.

packages/data-designer/src/data_designer/interface/composite_workflow.py:91-145, 329 — Up to three on-disk views of a stage, with inconsistent rules for which one each surface returns

  • What: A workflow stage can produce up to three distinct on-disk views of its data, and each user-facing surface uses different rules to pick between them:

    1. Main stage output<stage>/parquet-files/ (always produced).
    2. Postprocessor output<stage>/postprocessors/parquet-files/ (when postprocessors=[...] is set).
    3. Callback output — whatever on_success(...) returns (when set).

    The exposed routing rules:

    • results[name].load_dataset() reads view (2) when postprocessors is set, otherwise view (1). It is never the callback view.
    • results.load_stage_output(name) reads view (3) when on_success is set; otherwise the selected processor artifact for output="processor:<x>"; otherwise view (1)/(2) per the rule above.
    • Downstream stages always seed from whatever load_stage_output would return.
    • For a stage with output="processor:compact" and no postprocessors, results["compact"].load_dataset() is still the main parquet-files/ view (with all columns), not the selected processor artifact. Different from the postprocessor case.
    • With both postprocessors=[...] and on_success=cb set, results[name].load_dataset() and results.load_stage_output(name) can return different DataFrames for the same stage.

    Both test_composite_workflow_postprocessors_transform_stage_output and test_composite_workflow_output_can_select_processor_artifact document the per-feature behavior, but neither tests the combined case, and the class docstring on CompositeWorkflowResults only calls out the on_success rule.

  • Why: Users debugging a workflow will reasonably assume results[name].load_dataset() and results.load_stage_output(name) are interchangeable views of "the stage's output". With this PR they are only sometimes interchangeable, and the rule depends on which of three knobs the stage was configured with. Two readers comparing notes about the same stage can answer "how many rows did draft_qa produce?" with different numbers without either being wrong.

  • Suggestion: Pick one of the two:

    • (a) Make it consistent. results[name] always exposes the main create() result. load_stage_output(name) is the one surface that follows postprocessors/output=/on_success. This collapses three rules into one and matches the mental model users already have for DatasetCreationResults.
    • (b) Keep current behavior but document it. Spell out the full routing table in the CompositeWorkflowResults class docstring (or the add_stage() docstring), including the combined postprocessors + on_success case. At minimum, add a test that exercises both knobs together so the contract is locked in.

    Either way I'd recommend explicitly discouraging the postprocessors + on_success combination in docs unless there's a concrete use case the team wants to support — at present it's tested as a single corner case (test_composite_workflow_callback_receives_main_stage_artifact_with_postprocessors) but the resulting three-view semantics are hard to reason about in user code.

packages/data-designer/src/data_designer/interface/composite_workflow.py:294-301_count_parquet_records runs unconditionally before on_success, blocking the callback from rescuing the stage

  • What:
    output_seed_path = _resolve_stage_output_path(output_source_result, stage.output)
    callback_output_path = None
    output_records = _count_parquet_records(output_seed_path)   # always runs
    
    if stage.on_success is not None:
        callback_output_path = Path(stage.on_success(result.artifact_storage.base_dataset_path))
        output_seed_path = callback_output_path
        output_records = _count_parquet_records(callback_output_path)
    We always count records on the resolved selected output, then throw the value away and recount on the callback output. The first count is unused work in the success case and a real failure mode in the edge case.
  • Why: When on_success is set, the callback is the stage's escape hatch for transforming or replacing the stage output before downstream stages see it (e.g., the docs explicitly call out using on_success to read a processor artifact and write a LocalFileSeedSource-compatible dataset). The current order means a stage whose primary selected output happens to be empty or non-existent — exactly the situation on_success exists to handle — raises from the first _count_parquet_records and never gets to call the callback. That defeats the design intent of on_success as a recovery / reshaping hook. It also re-leaks the same unwrapped PyArrow / FileNotFoundError flagged in the warning above.
  • Suggestion: Only count what actually feeds downstream. Run the callback first when it's configured, then resolve output_seed_path/output_records once:
    callback_output_path: Path | None = None
    if stage.on_success is not None:
        callback_output_path = Path(stage.on_success(result.artifact_storage.base_dataset_path))
        output_seed_path = callback_output_path
    else:
        output_seed_path = _resolve_stage_output_path(output_source_result, stage.output)
    output_records = _count_parquet_records(output_seed_path)
    Combined with the _count_parquet_records normalization fix above, this also gives on_success callbacks a clean way to recover from an otherwise-empty selected output.

packages/data-designer/src/data_designer/interface/composite_workflow.py:154-190add_processor(...) and postprocessors=[...] have indistinguishable call sites but different downstream visibility (privacy/correctness footgun)

  • What: A user has two ways to attach a ProcessorConfig to a workflow stage, and the names don't telegraph which one propagates to the next stage:

    config_builder.add_processor(DropColumnsProcessorConfig(name="redact", column_names=["secret"]))
    workflow.add_stage("draft_qa", config_builder)   # next stage still sees `secret`
    
    workflow.add_stage(
        "draft_qa",
        config_builder,
        postprocessors=[DropColumnsProcessorConfig(name="redact", column_names=["secret"])],
    )                                                # next stage does NOT see `secret`

    Both call sites take the same ProcessorConfig type. The first writes the processor's output to <stage>/processors-files/redact/ as a side artifact — downstream workflow stages still seed from the main parquet-files/, which contains the un-dropped column. The second runs the processor as a separate seeded create() whose output replaces what flows downstream.

  • Why: For DropColumns specifically, this is a real correctness footgun, and the most likely instinct for a user is wrong. Someone writing add_processor(DropColumnsProcessorConfig(column_names=["secret"])) on a stage in a workflow has a reasonable mental model: "I'm dropping secret, downstream shouldn't see it." That model is silently violated — the secret column is propagated to every downstream stage's seed dataset, gets included in their fingerprints, and ends up in their artifacts. The same shape applies to any "filter for safety" / "strip PII" pattern. The only signal that the user is using the wrong knob is reading the source of composite_workflow.run() and noticing that downstream seeding goes through output_seed_path, not processors_outputs_path.

  • Suggestion: Two complementary fixes:

    • Document the difference in the add_stage() docstring and the add_processor() docstring, with a short table of "where does the output go?" and "does downstream see it?". This is cheap and stops the foot-gun in its tracks.
    • Consider warning at runtime when a workflow stage's config_builder has a DropColumnsProcessorConfig (or other "destructive" processors) and the same names aren't also in postprocessors. Something like: logger.warning("Stage %r has inline DropColumns processor %r; columns dropped here remain visible to downstream stages. Move to add_stage(postprocessors=[...]) if you intended downstream cleanup."). Even gated behind a feature-detect-only check, this would catch the privacy-leak case without false-positiving on legitimate side-artifact usage.

    Renaming the kwarg to something directional (see the Suggestion below) would also help, but won't retroactively rescue users who reach for add_processor first.

packages/data-designer/src/data_designer/interface/data_designer.py:500-503 — Inline import for CompositeWorkflow violates STYLEGUIDE.md import rule

  • What: compose_workflow() imports CompositeWorkflow inside the function body rather than at module scope.
  • Why: composite_workflow.py only references DataDesigner inside its TYPE_CHECKING block, so a top-level from data_designer.interface.composite_workflow import CompositeWorkflow in data_designer.py will not create a circular import. STYLEGUIDE.md is explicit on this: "Place imports at module level, not inside functions (exception: unavoidable for performance reasons)." CompositeWorkflow itself doesn't pull in any heavy third-party imports beyond what data_designer.py already loads (it goes through data_designer.lazy_heavy_imports), so there's no perf justification either. Leaving it inline also hides the dependency from ruff isort, so import ordering won't catch future reshuffles.
  • Suggestion: Hoist the import to the top of data_designer.py alongside the other data_designer.interface.* imports:
    from data_designer.interface.composite_workflow import CompositeWorkflow
    Then drop the inline import inside compose_workflow(). If a future change does introduce a circular import (e.g., composite_workflow.py starts using DataDesigner at runtime), add an explicit comment at the inline-import site explaining the constraint so the next reader knows why the rule is bent.

Suggestions — Take it or leave it

packages/data-designer/src/data_designer/interface/composite_workflow.py:63-66SkippedStageResult.status could be a Literal/Enum

  • What: status: str only ever takes the literal value "skipped_empty_upstream" (set on line 225 inside run()).
  • Why: Bare str invites typos in user code (result.status == "skipped_empty_upstrem" would silently never match) and is less self-documenting than a Literal["skipped_empty_upstream"] or an Enum. Matches the pattern already used in ResumeMode, SamplingStrategy, etc.
  • Suggestion: Either status: Literal["skipped_empty_upstream"] = "skipped_empty_upstream" (so users get a static-typing hint) or promote to a StageStatus enum if you anticipate "completed_empty", "failed", etc. landing here in Phase 3.

packages/data-designer/src/data_designer/interface/composite_workflow.py:110-119self.final_result accessed for its side effect only

  • What: load_dataset() and count_records() call self.final_result and discard the value purely so the property raises when the final stage was skipped.
  • Why: This works but reads like dead code — a reviewer pasting self.final_result into the body looks like a forgotten return value. A named guard makes intent clear.
  • Suggestion:
    def load_dataset(self) -> pd.DataFrame:
        self._require_final_result()
        return self.load_stage_output(self.final_stage_name)
    where _require_final_result() returns None and raises if skipped. Same shape for count_records().

packages/data-designer/src/data_designer/interface/results.py:89-97 and packages/data-designer-config/src/data_designer/config/preview_results.py:46-56to_config_builder() docstring should warn about memory cost

  • What: Both to_config_builder() helpers fully materialize the dataset (self.load_dataset()/self.dataset) and df.copy(). The plan file documents this as a notebook convenience, but the public method docstrings don't say "loads the full dataset into memory" — easy to grab in production code and OOM on a large run.
  • Suggestion: Either add a one-line "Loads the full dataset into memory; intended for interactive use. For production pipelines use CompositeWorkflow." note to both docstrings, or call out the limitation in Notes: so it surfaces in generated API docs.

packages/data-designer/src/data_designer/interface/composite_workflow.py:434-444 — Document tie-breaking when a main-stage processor name and a postprocessor name collide

  • What: _select_output_result prefers the postprocessor result when the requested output="processor:<name>" matches the name of any postprocessor, even if a main-stage processor with the same name also exists. This is intentional (the postprocessor is the more recent transform) but easy to miss when validation in _validate_stage_output_processor accepts the name from either set without flagging the duplicate.
  • Suggestion: Either flag duplicates at add_stage() time (raise if the same processor name appears in both config_builder.get_processor_configs() and postprocessors), or add a short note in _validate_stage_output_processor / the add_stage() docstring stating that postprocessor names take precedence.

packages/data-designer/src/data_designer/interface/composite_workflow.py:162 — Rename postprocessors for directional intent

  • What: The kwarg name postprocessors only tells the reader when the processor runs ("after the stage"), not what it does to the pipeline ("changes what downstream sees"). Combined with the footgun above, this makes the wrong choice between add_processor(...) and postprocessors=[...] more likely than it should be.
  • Why: add_processor is already established for "processors that run as part of this stage's normal pipeline." The new attachment point's value-add is that it routes the processor's output to downstream stages — that's the actual semantic difference, but the name doesn't say it. A directional name communicates intent at the call site, before the user has to consult the docs:
    Name When it runs? What downstream sees? Closed compound?
    postprocessors yes unclear yes
    post_processors yes unclear no (breaks the codebase's "Processor as one word" convention — see add_processor, ProcessorConfig, ProcessorRunner)
    output_processors implied yes yes
    downstream_processors yes yes yes
    between_stage_processors yes yes yes
  • Suggestion: If the team is going to spend the rename budget once (cheap pre-release, expensive after), spend it on directional intent. My favorite is output_processors — short, parallels output="processor:...", and naturally hints that what it produces is the stage's output. Whatever the choice, do not split "processor" with an underscore (post_processors) — that breaks the existing closed-compound convention and adds a new vocabulary item without solving the directional-intent problem.

packages/data-designer/src/data_designer/interface/composite_workflow.py:280-291 — Two parquet-files/ directories at different depths is visually ambiguous

  • What: When postprocessors=[...] is set, the stage artifact tree looks like:
    artifacts/<workflow>/stage-0-draft_qa/
      parquet-files/                  # main stage output (still has all columns)
      builder_config.json
      metadata.json
      postprocessors/                 # the synthesized sub-stage
        parquet-files/                # post-processed output (what flows downstream)
        builder_config.json           # auto-synthesized seeded-processor-only config
        metadata.json
    
    Two parquet-files/ directories, two builder_config.json files, and the inner builder_config.json is a config the user never wrote (it's the seeded-processor-only config the workflow synthesizes via _postprocessor_config_builder). A user opening the directory to debug "what did this stage actually produce?" has to know which one to look at and that the inner config is auto-generated.
  • Suggestion: A short README.md or _postprocessor_overview.txt file at the postprocessor sub-stage root explaining "this directory was generated by the workflow as the postprocessor pass; the parent's parquet-files/ is the unmodified main stage output" would prevent the confusion entirely. Could also add a key in workflow-metadata.json (e.g. "postprocessor_overview") pointing readers at the right directory for "what fed the next stage." This is purely a docs / artifact-layout polish, but it's nearly free and saves the next debugger 5–10 minutes.

What Looks Good

  • Test coverage is impressive — linear chaining, explode/filter via callbacks, async + concurrent acreate with the both_started event harness, processor-only stages, postprocessors, processor-output selection with and without postprocessors, push-to-hub guarding selected outputs, and a parametrized validation matrix covering invalid workflow/stage/output names. The split between fast mock-backed tests and _real_data_designer end-to-end tests with stub providers is the right shape.
  • Throttle handling lands cleanly: DataDesigner._throttle_manager is rebuilt in set_run_config, passed through _create_resource_provider → create_resource_provider → create_model_registry only when not already provided, and verified by test_data_designer_reuses_throttle_manager_across_create_calls and test_run_config_setting_persists. Good plumbing.
  • The seeded-processor-only validation fix in 9d9d1d1 nails the right scope — gating the processor-drop filter on c.column_type == SEED_DATASET preserves the new processor-only path without re-introducing the false-positive on SchemaTransform configs. The dedicated tests (..._allows_generated_columns_dropped_by_processors, ..._still_rejects_seed_only_config) document the contract well.
  • SeedReaderRegistry.get_reader now copy(...)s before attach, which is a low-risk fix for the concurrent-creates race and is exercised both by the unit test update (reader is not df_reader) and by test_acreate_supports_gathered_real_async_workflows.

Verdict

Needs changes — no Critical issues, and the two prior review points are resolved, but the Warnings are worth addressing before merge:

  • Normalize FileNotFoundError / Arrow errors from _count_parquet_records on user-callback paths into DataDesignerWorkflowError (or pre-validate callback_output_path.exists()).
  • Run on_success before resolving/counting output_seed_path so callbacks can rescue empty or missing selected outputs (and avoid the duplicate _count_parquet_records call).
  • Unify (or fully document) the routing rules so results[name] and load_stage_output(name) have a single mental model across postprocessors, output="processor:<name>", and on_success — including the combined cases.
  • Close the add_processor vs postprocessors=[...] privacy/correctness footgun via docs and/or a runtime warning for destructive inline processors in a workflow stage.
  • Hoist the inline CompositeWorkflow import in data_designer.py:500-503 to module scope per STYLEGUIDE.md.

Suggestions are take-it-or-leave-it, though the postprocessorsoutput_processors (or similar) rename is much cheaper to do now than after release.


This review was generated by an AI assistant.

Signed-off-by: Andre Manoel <amanoel@nvidia.com>
@andreatgretel
Copy link
Copy Markdown
Contributor Author

Thanks Nabin, this was super helpful. I pushed 881aba09 addressing the warning set:

  • Renamed postprocessors to output_processors so the stage-boundary behavior is clearer at the call site.
  • Normalized missing/unreadable callback and selected-output parquet paths through DataDesignerWorkflowError.
  • Moved on_success before output counting, so callbacks can replace the selected output and we only count the path that actually feeds downstream.
  • Kept the two-surface model but documented it: results[name] is the effective create() result for that stage, while load_stage_output(name) is the selected downstream output, including output=... and on_success.
  • Added duplicate-name validation between main stage processors and output_processors.
  • Hoisted the CompositeWorkflow import and replaced the skipped status string with SkippedStageStatus.
  • Added/updated coverage for missing callback output, empty callback default failure, callback replacing selected output before counting, output processor routing, duplicate processor names, and combined callback/output-processor cases.

I also updated the PR description and reran the targeted suites plus smoke tests, including the real NVIDIA provider smoke. Latest smoke rerun: review smoke 2 passed, temp workflow smoke pack 10 passed.

@nabinchha
Copy link
Copy Markdown
Contributor

Followup on 881aba09 ("fix: clarify workflow output processors") — thanks for turning this around so fast, @andreatgretel. All five Warnings from my prior review are resolved and 5 of 6 Suggestions are addressed; the remaining Suggestion (artifact-tree marker file) is even more take-it-or-leave-it now that the directory is named output-processors/. One new Warning emerged on a pass through the test surface — details below.

New observations on 881aba09

Warnings — worth addressing before merge

packages/data-designer/src/data_designer/interface/composite_workflow.py:217 and tests reach into private DataDesigner state

  • What: Both production code and the new tests cross the underscore-private boundary on DataDesigner:

    • Production: composite_workflow.py:217 does workflow_path = self._data_designer._artifact_path / self.name. CompositeWorkflow has no other way to know where to materialize stage directories because compose_workflow(self, *, name: str) doesn't accept a path and _artifact_path isn't exposed publicly.
    • Tests: four references to data_designer._artifact_path in test_composite_workflow.py (lines 73, 180, 883, 924 — three inside fake_create mocks that mirror the production fallback, plus one direct assertion that the constructor stored what was passed in), and two references to data_designer._create_resource_provider(...) in test_acreate.py:131-132 (the throttle-manager-reuse assertion has no public seam to land on).
  • Why: STYLEGUIDE.md and AGENTS.md are explicit that the public/private split should hold — public APIs are what callers depend on, and private state is implementation. The new feature inverts that: CompositeWorkflow is a public class but its core "where do I write?" rule is implemented by reading data_designer._artifact_path, and the tests then have to follow into the same private hatch to pin the behavior down. Once a private symbol is referenced from another module and tests, it's effectively a public contract that just happens to start with an underscore. That makes future refactors (rename, lazy-resolve, scope to a per-call temp dir) silently risky and mismatches the rest of the public surface.

  • Suggestion: Pick one, in roughly increasing scope:

    1. Promote artifact_path to a public read-only property on DataDesigner. The constructor parameter is already public (DataDesigner(artifact_path=...)), so this is a one-line @property that just returns self._artifact_path. Tests and composite_workflow.py:217 then drop the underscore, and the rule becomes part of the documented surface instead of a leak.
    2. Thread the path through compose_workflow / CompositeWorkflow.__init__. compose_workflow(name=..., artifact_path=...) defaulting to the parent's path, with CompositeWorkflow storing it as self._artifact_path. Removes the cross-object reach entirely and lets users override the workflow root without mutating the parent.
    3. Expose a throttle_manager (or model_registry) accessor on DataDesigner so test_data_designer_reuses_throttle_manager_across_create_calls can assert on the public surface instead of _create_resource_provider. The reuse-across-stages contract is exactly the kind of public guarantee that deserves a public seam.

    (1) + (3) would cover all six accesses with minimal disruption. (2) is the structurally cleanest but bigger; either ordering is fine.

Suggestions — take it or leave it

packages/data-designer/src/data_designer/interface/composite_workflow.py:466-483 — bare except Exception in the parquet read wrappers

  • Both _count_parquet_records and _load_parquet_dataset use except Exception as e: raise DataDesignerWorkflowError(...) from e. That's appropriate at a normalization boundary, but Exception is broader than needed and will swallow things like KeyError/AttributeError introduced by future refactors as a "failed to read parquet file" message. Tightening to except (OSError, lazy.pa.ArrowException, ValueError) (or whatever the intended PyArrow base class is) would surface unintended bugs faster.

packages/data-designer/src/data_designer/interface/composite_workflow.py:75-81CompositeWorkflowResults docstring is technically correct but easy to misread

  • "Per-stage entries are the effective DataDesigner.create() results. For stages with output_processors, this is the output-processor create result." A reader scanning for "what's in results['my_stage']?" is likely to assume the same data that flows downstream. The actual contract — results[name].load_dataset() returns the output-processor create() result regardless of whether output="processor:<x>" or on_success redirected the downstream feed — only becomes obvious after reading run(). A two-bullet table in the class docstring (results[name] → … / load_stage_output(name) → …) would close the gap entirely without any code change.

packages/data-designer/src/data_designer/interface/composite_workflow.py:299-311 — output-processor create(num_records=actual_records, ...) when the main stage produced 0 rows

  • If a callback-less stage's main create() returns actual_records == 0, we still call self._data_designer.create(output_processor_builder, num_records=0, ...), which depending on engine validation will either raise inside create() (caught and re-raised as a "failed" stage with no obvious hint that empty input was the cause) or no-op. Not new in this commit, but with the cleaner ordering it's more visible. A two-line guard — if actual_records == 0: skip output_processors and let the empty-stage branch handle it — would make the empty path identical with and without output_processors.

What looks good in 881aba09

  • The reordering around on_success is exactly the shape suggested: callback resolves the path, single count afterward, no double-work in the success case and no leaked Arrow errors in the failure case.
  • The SkippedStageStatus enum is wired all the way through (__init__.py, _LAZY_IMPORTS, the existing skip test). Good full-stack hygiene on what was a pure-cleanup suggestion.
  • _validate_distinct_output_processors runs before _validate_stage_output_processor, so a user passing output="processor:foo" with foo duplicated across both surfaces gets the more useful error first. Subtle but right.
  • New tests are well-targeted: missing_callback_output_raises_workflow_error, callback_replaces_selected_output_before_counting, and rejects_duplicate_output_processor_names each pin one of the prior Warnings down to a regression-resistant assertion.
  • The plan doc (plans/workflow-chaining/workflow-chaining.md) was updated end-to-end to match the new naming, so future contributors won't see stale postprocessors vocabulary.

Verdict

Needs changes — every prior Warning is resolved (huge progress), and the new finding is a single, well-scoped one: close the underscore-private reach between CompositeWorkflow and DataDesigner (and the matching test references). Promoting artifact_path to a public property is enough to drop five of the six accesses; adding a throttle_manager accessor (or reframing the throttle-reuse test as a behavioral check on acreate()) closes the last one. Suggestions are take-it-or-leave-it.

@nabinchha
Copy link
Copy Markdown
Contributor

Thanks for putting this together, @andreatgretel — this is a substantial, well-tested addition and I really enjoyed the read.

Summary

Adds CompositeWorkflow + DataDesigner.compose_workflow() to chain create() calls across explicit stages with on-disk handoff, plus async acreate(), shared throttle-manager reuse, output_processors, and output="processor:<name>" selected outputs. The implementation matches the PR description: scope is deliberately Phase 1 + processor-boundary extensions, with resume / DAGs / parallel branches called out as follow-up work.

Findings

Warnings — Worth addressing

packages/data-designer/src/data_designer/interface/composite_workflow.py:74-210 — Public API is missing docstrings

  • What: CompositeWorkflow, CompositeWorkflow.__init__, CompositeWorkflow.run, plus many CompositeWorkflowResults methods (load_dataset, count_records, get_stage_output_path, load_stage_output, count_stage_output_records, export, push_to_hub, final_result) have no docstrings. DataDesigner.compose_workflow() in data_designer.py:500 is also bare.
  • Why: STYLEGUIDE requires docstrings on public API classes/functions, and CompositeWorkflow.run() in particular has non-obvious behavior worth explaining at the call site (deterministic stage dirs, fingerprinting, on_success contract, empty-output handling, selected-output semantics). Notebook users will mostly discover this via tab-completion and help(...).
  • Suggestion: Add Google-style docstrings to the class, __init__, run, and the result-loader/exporter methods. A one-liner is enough for the simple ones; run() and compose_workflow() deserve a few Args/Returns/Raises sections since they're the headline entry points.

packages/data-designer/src/data_designer/interface/data_designer.py:500-501compose_workflow() doesn't surface its experimental status programmatically

  • What: docs/concepts/workflow-chaining.md and the Fern page both lead with an "Experimental Feature" admonition, but in code there's no docstring warning, no log line, and no FutureWarning/PendingDeprecationWarning when constructing a workflow. Users coming from autocomplete will treat it like any other stable API.
  • Why: The PR description explicitly calls out that the workflow API, metadata schema, and artifact layout are subject to change. Signaling that in code reduces the chance someone builds production tooling on the metadata layout before it stabilizes.
  • Suggestion: At minimum, mention "experimental" in the compose_workflow() and CompositeWorkflow docstrings. Optionally, log a single INFO line ("🧪 Workflow chaining is experimental and the artifact layout may change.") on first use.

packages/data-designer/src/data_designer/interface/composite_workflow.py:217 — Reaches into the private _artifact_path on the parent DataDesigner

  • What: workflow_path = self._data_designer._artifact_path / self.name accesses a leading-underscore attribute from a sibling module.
  • Why: Same data_designer.interface package, so it's not crossing a layer boundary, but it makes refactoring the artifact-path representation on DataDesigner (e.g. lazy resolution, indirection) harder than it should be. The pattern elsewhere in the project is to expose a property.
  • Suggestion: Add a small @property artifact_path(self) -> Path on DataDesigner and use that here. (You can keep the private attribute as the storage backing.)

Suggestions — Take it or leave it

packages/data-designer/src/data_designer/interface/composite_workflow.py:173output: str is stringly-typed

  • What: The output parameter on add_stage() accepts "final" or "processor:<name>" as raw strings. Validation lives in _validate_stage_output / _validate_stage_output_processor, so typos like "processors:foo" will surface as a runtime error.
  • Why: For a feature with two well-defined modes, a small constructor pattern (e.g. StageOutput.final(), StageOutput.processor("name")) gives autocompletion, narrower types, and removes the string parsing. Since the API is experimental, the current shape is shippable, but worth revisiting before stabilization.
  • Suggestion: Optional now; flag it on the workflow-chaining plan as something to consider before exiting experimental status.

packages/data-designer/src/data_designer/interface/composite_workflow.py:512-514 — Workflow metadata write is non-atomic

  • What: _write_workflow_metadata calls path.write_text(...) directly. A crash mid-write would leave a truncated/corrupt JSON.
  • Why: Metadata is updated multiple times per stage (running → completed/failed). Each write opens up a small corruption window. Low impact because re-running the workflow rebuilds it, but if anyone parses partial metadata from disk asynchronously (e.g. UI), they may see an inconsistent state.
  • Suggestion: Write to a sibling temp file and os.replace(...) over the target. Not blocking.

packages/data-designer/src/data_designer/interface/composite_workflow.py:46OnSuccessCallback is sync-only

  • What: OnSuccessCallback = Callable[[Path], Path | str]. With acreate() landing in this PR, async callbacks will likely be desirable when the workflow grows an arun().
  • Why: Not actionable today (no arun() yet), but worth keeping in mind so the eventual API can either accept Awaitable[Path | str] or split into a separate AsyncOnSuccessCallback.

packages/data-designer/tests/interface/test_acreate.py:100-101 — 2-second threading.Event timeout may flake on slow CI

  • What: assert both_started.wait(2) and assert release.wait(2) use 2-second timeouts in test_acreate_does_not_serialize_create_calls.
  • Why: Generally fine locally, but cold container starts under heavy CI load can blow past 2s. The test is non-deterministic on the timing path.
  • Suggestion: Bump to ~5s. The successful path still returns immediately on set(), so the bump is only paid in the failure case.

packages/data-designer/src/data_designer/interface/composite_workflow.py:194-203output_processors are cloned twice

  • What: _clone_processors(output_processors) at add_stage() produces deep copies stored in _WorkflowStage, then _output_processor_config_builder does another processor.model_copy(deep=True) per processor when building the inner config.
  • Why: Defensive but redundant — by the time we get to _output_processor_config_builder, the processors are already isolated copies.
  • Suggestion: Drop the inner model_copy(deep=True) since the cloned tuple is already detached from the user's input. Minor.

What Looks Good

  • Strong test coverage: 43 new workflow tests plus the async ones, mixing fast unit-level tests (_patch_create) with real-engine integration tests (_real_data_designer). The split makes it easy to keep CI fast while still catching engine-level regressions.
  • Greptile's earlier flags are handled cleanly: _parquet_files unification across export()/load_dataset()/count_records() (with a mixed-batch regression test) and the seed-dataset-scoped validate_columns_not_all_dropped fix (with regressions for both the seeded-processor-only case and the generated-columns-dropped case) are well executed.
  • Defensive builder/processor cloning in add_stage() and again in run()test_composite_workflow_clones_stage_builders_on_add nails down the contract, which is exactly the kind of subtle bug that bites later.
  • Throttle-manager sharing across sequential stages and concurrent acreate() calls is the right design choice (rate limits should be account-level, not call-level) and test_data_designer_reuses_throttle_manager_across_create_calls plus test_run_config_setting_persists cover both the reuse and rebuild-on-set-run-config paths.
  • Error wrapping: DataDesignerWorkflowError is a proper DataDesignerError subclass, consistent with the project's "errors normalize at boundaries" rule.
  • Docs: The experimental admonition is clearly stated in both MkDocs and Fern pages, with a useful "current limits" section that mirrors the still-missing scope in the PR description.

Verdict

Ship it (with nits) — only the missing docstrings, the experimental-status surfacing, and the private-attribute access rise above optional polish, and all of them are quick follow-ups. None of them block the feature behavior, which already has strong test coverage and clean error handling.


This review was generated by an AI assistant.

nabinchha
nabinchha previously approved these changes May 18, 2026
Copy link
Copy Markdown
Contributor

@nabinchha nabinchha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving — see #636 (comment) for the full review. Nits called out there are non-blocking.

@github-actions
Copy link
Copy Markdown
Contributor

MkDocs preview: https://837c4c0f.dd-docs-preview.pages.dev

Fern preview: https://nvidia-preview-pr-636.docs.buildwithfern.com/nemo/datadesigner

Fern previews include the docs-website version archive with PR changes synced into latest. Notebook tutorials are rendered without execution outputs in previews.

Comment on lines +557 to +565
def _validate_distinct_output_processors(
config_builder: DataDesignerConfigBuilder,
output_processors: list[ProcessorConfig],
) -> None:
stage_processor_names = {processor.name for processor in config_builder.get_processor_configs()}
duplicate_names = stage_processor_names.intersection(processor.name for processor in output_processors)
if duplicate_names:
names = ", ".join(sorted(duplicate_names))
raise DataDesignerWorkflowError(f"Output processor names must be distinct from stage processor names: {names}.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 _validate_distinct_output_processors checks for name collisions between the stage's inline processors and output_processors, but it does not check for duplicate names within output_processors itself. If two entries share the same name (e.g., two DropColumnsProcessorConfig(name="drop") instances), both would write to the same processors_outputs_path / "drop" directory, with the second run silently overwriting the first. Any call to load_stage_output() or load_processor_dataset() referencing that name would then return the output of whichever processor ran last, not the one the author intended.

Suggested change
def _validate_distinct_output_processors(
config_builder: DataDesignerConfigBuilder,
output_processors: list[ProcessorConfig],
) -> None:
stage_processor_names = {processor.name for processor in config_builder.get_processor_configs()}
duplicate_names = stage_processor_names.intersection(processor.name for processor in output_processors)
if duplicate_names:
names = ", ".join(sorted(duplicate_names))
raise DataDesignerWorkflowError(f"Output processor names must be distinct from stage processor names: {names}.")
def _validate_distinct_output_processors(
config_builder: DataDesignerConfigBuilder,
output_processors: list[ProcessorConfig],
) -> None:
stage_processor_names = {processor.name for processor in config_builder.get_processor_configs()}
duplicate_names = stage_processor_names.intersection(processor.name for processor in output_processors)
if duplicate_names:
names = ", ".join(sorted(duplicate_names))
raise DataDesignerWorkflowError(f"Output processor names must be distinct from stage processor names: {names}.")
output_processor_names = [processor.name for processor in output_processors]
seen: set[str] = set()
duplicates_within = {name for name in output_processor_names if name in seen or seen.add(name)} # type: ignore[func-returns-value]
if duplicates_within:
names = ", ".join(sorted(duplicates_within))
raise DataDesignerWorkflowError(f"Output processor names must be unique within output_processors: {names}.")
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/data-designer/src/data_designer/interface/composite_workflow.py
Line: 557-565

Comment:
`_validate_distinct_output_processors` checks for name collisions between the stage's inline processors and `output_processors`, but it does not check for duplicate names *within* `output_processors` itself. If two entries share the same name (e.g., two `DropColumnsProcessorConfig(name="drop")` instances), both would write to the same `processors_outputs_path / "drop"` directory, with the second run silently overwriting the first. Any call to `load_stage_output()` or `load_processor_dataset()` referencing that name would then return the output of whichever processor ran last, not the one the author intended.

```suggestion
def _validate_distinct_output_processors(
    config_builder: DataDesignerConfigBuilder,
    output_processors: list[ProcessorConfig],
) -> None:
    stage_processor_names = {processor.name for processor in config_builder.get_processor_configs()}
    duplicate_names = stage_processor_names.intersection(processor.name for processor in output_processors)
    if duplicate_names:
        names = ", ".join(sorted(duplicate_names))
        raise DataDesignerWorkflowError(f"Output processor names must be distinct from stage processor names: {names}.")
    output_processor_names = [processor.name for processor in output_processors]
    seen: set[str] = set()
    duplicates_within = {name for name in output_processor_names if name in seen or seen.add(name)}  # type: ignore[func-returns-value]
    if duplicates_within:
        names = ", ".join(sorted(duplicates_within))
        raise DataDesignerWorkflowError(f"Output processor names must be unique within output_processors: {names}.")
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, agreed this is worth tracking as a follow-up. Created #675 for the duplicate output_processors validation gap.

@andreatgretel andreatgretel merged commit 6055290 into main May 18, 2026
50 checks passed
@andreatgretel andreatgretel deleted the andreatgretel/feat/workflow-chaining branch May 20, 2026 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants