NVIDIA-NeMo · andreatgretel · May 18, 2026 · May 12, 2026 · May 12, 2026 · May 12, 2026
@@ -0,0 +1,107 @@
+# Workflow Chaining
+
+!!! warning "Experimental Feature"
+    Workflow chaining is currently **experimental** and under active development. The documentation, examples, workflow API, metadata schema, and artifact layout are subject to significant changes in future releases. If you encounter any issues, have questions, or have ideas for improvement, please consider starting [a discussion on GitHub](https://github.com/NVIDIA-NeMo/DataDesigner/discussions).
+
+Workflow chaining lets you split a dataset build into named stages. Each stage runs a normal `DataDesigner.create()` call, writes its own artifact directory, and hands a selected parquet output to the next stage as a `LocalFileSeedSource`.
+
+Use it when one generation step naturally depends on the cleaned or reshaped output of another step, especially when a processor-only stage is clearer than mixing all transformations into one config.
+
+## Basic shape
+
+```python
+import data_designer.config as dd
+from data_designer.interface import DataDesigner
+
+data_designer = DataDesigner()
+
+drafts = (
+    dd.DataDesignerConfigBuilder(model_configs=[fast_model])
+    .with_seed_dataset(dd.LocalFileSeedSource(path="parsed_docs/*.parquet"))
+    .add_column(
+        name="chunk_summary",
+        column_type="llm_text",
+        model_alias="fast",
+        prompt="Summarize this passage:\n\n{{ text }}",
+    )
+    .add_column(
+        name="question",
+        column_type="llm_text",
+        model_alias="fast",
+        prompt="Write a question about this passage:\n\n{{ chunk_summary }}",
+    )
+    .add_column(
+        name="answer",
+        column_type="llm_text",
+        model_alias="fast",
+        prompt="Answer {{ question }} using this passage:\n\n{{ text }}",
+    )
+)
+
+chatml = dd.DataDesignerConfigBuilder().add_processor(
+    dd.SchemaTransformProcessorConfig(
+        name="chatml",
+        template={
+            "messages": [
+                {"role": "user", "content": "{{ question }}"},
+                {"role": "assistant", "content": "{{ answer }}"},
+            ],
+        },
+    )
+)
+
+workflow = data_designer.compose_workflow(name="doc-qa")
+workflow.add_stage(
+    "drafts",
+    drafts,
+    num_records=100,
+    output_processors=[
+        dd.DropColumnsProcessorConfig(
+            name="drop_scratch",
+            column_names=["text", "chunk_summary"],
+        )
+    ],
+)
+workflow.add_stage("chatml", chatml, output="processor:chatml")
+
+results = workflow.run()
+training_rows = results.load_dataset()
+results.export("chatml.jsonl")
+```
+
+## Stage outputs
+
+A stage can expose different views of its data:
+
+| Surface | What it returns |
+|---------|-----------------|
+| `results["stage_name"]` | The effective `DatasetCreationResults` for that stage. If the stage uses `output_processors`, this points at the output-processor run. |
+| `results.load_stage_output("stage_name")` | The selected output handed to downstream stages. This follows `output="processor:<name>"` and `on_success`. |
+| `results.load_dataset()` | The selected output from the final stage. |
+
+Processors added with `config_builder.add_processor(...)` run inside the stage and usually create side artifacts. They do not automatically change what the next stage receives. Use `output_processors=[...]` when a processor should define the stage boundary output.
+
+## Processor-only stages
+
+Stages can be processor-only when they receive seed data from an upstream stage:
+
+```python
+cleanup = dd.DataDesignerConfigBuilder().add_processor(
+    dd.DropColumnsProcessorConfig(
+        name="drop_private_fields",
+        column_names=["email", "raw_notes"],
+    )
+)
+
+workflow.add_stage("cleanup", cleanup)
+```
+
+This is useful for final cleanup, schema transforms, and format-specific export preparation.
+
+## Current limits
+
+- Stages are linear. DAGs, parallel branches, and joins are planned separately.
+- Stage-level resume is not implemented yet.
+- `push_to_hub()` does not support selected processor or callback outputs yet. Use `export()` for the selected workflow output.
+- `on_success` callbacks are trusted user code. If a callback returns a path, Data Designer reads that path as the next stage input.
+- The artifact layout is intended for inspection, but it is not yet a stable public contract.
@@ -35,6 +35,8 @@ navigation:
         path: ./latest/pages/concepts/validators.mdx
       - page: Processors
         path: ./latest/pages/concepts/processors.mdx
+      - page: Workflow Chaining
+        path: ./latest/pages/concepts/workflow-chaining.mdx
       - page: Person Sampling
         path: ./latest/pages/concepts/person_sampling.mdx
       - page: Traces

@@ -0,0 +1,111 @@
+---
+title: "Workflow Chaining"
+description: ""
+position: 8
+---
+<Warning title="Experimental Feature">
+  Workflow chaining is currently **experimental** and under active development. The documentation, examples, workflow API, metadata schema, and artifact layout are subject to significant changes in future releases. If you encounter any issues, have questions, or have ideas for improvement, please consider starting [a discussion on GitHub](https://github.com/NVIDIA-NeMo/DataDesigner/discussions).
+</Warning>
+
+Workflow chaining lets you split a dataset build into named stages. Each stage runs a normal `DataDesigner.create()` call, writes its own artifact directory, and hands a selected parquet output to the next stage as a `LocalFileSeedSource`.
+
+Use it when one generation step naturally depends on the cleaned or reshaped output of another step, especially when a processor-only stage is clearer than mixing all transformations into one config.
+
+## Basic shape
+
+```python
+import data_designer.config as dd
+from data_designer.interface import DataDesigner
+
+data_designer = DataDesigner()
+
+drafts = (
+    dd.DataDesignerConfigBuilder(model_configs=[fast_model])
+    .with_seed_dataset(dd.LocalFileSeedSource(path="parsed_docs/*.parquet"))
+    .add_column(
+        name="chunk_summary",
+        column_type="llm_text",
+        model_alias="fast",
+        prompt="Summarize this passage:\n\n{{ text }}",
+    )
+    .add_column(
+        name="question",
+        column_type="llm_text",
+        model_alias="fast",
+        prompt="Write a question about this passage:\n\n{{ chunk_summary }}",
+    )
+    .add_column(
+        name="answer",
+        column_type="llm_text",
+        model_alias="fast",
+        prompt="Answer {{ question }} using this passage:\n\n{{ text }}",
+    )
+)
+
+chatml = dd.DataDesignerConfigBuilder().add_processor(
+    dd.SchemaTransformProcessorConfig(
+        name="chatml",
+        template={
+            "messages": [
+                {"role": "user", "content": "{{ question }}"},
+                {"role": "assistant", "content": "{{ answer }}"},
+            ],
+        },
+    )
+)
+
+workflow = data_designer.compose_workflow(name="doc-qa")
+workflow.add_stage(
+    "drafts",
+    drafts,
+    num_records=100,
+    output_processors=[
+        dd.DropColumnsProcessorConfig(
+            name="drop_scratch",
+            column_names=["text", "chunk_summary"],
+        )
+    ],
+)
+workflow.add_stage("chatml", chatml, output="processor:chatml")
+
+results = workflow.run()
+training_rows = results.load_dataset()
+results.export("chatml.jsonl")
+```
+
+## Stage outputs
+
+A stage can expose different views of its data:
+
+| Surface | What it returns |
+|---------|-----------------|
+| `results["stage_name"]` | The effective `DatasetCreationResults` for that stage. If the stage uses `output_processors`, this points at the output-processor run. |
+| `results.load_stage_output("stage_name")` | The selected output handed to downstream stages. This follows `output="processor:<name>"` and `on_success`. |
+| `results.load_dataset()` | The selected output from the final stage. |
+
+Processors added with `config_builder.add_processor(...)` run inside the stage and usually create side artifacts. They do not automatically change what the next stage receives. Use `output_processors=[...]` when a processor should define the stage boundary output.
+
+## Processor-only stages
+
+Stages can be processor-only when they receive seed data from an upstream stage:
+
+```python
+cleanup = dd.DataDesignerConfigBuilder().add_processor(
+    dd.DropColumnsProcessorConfig(
+        name="drop_private_fields",
+        column_names=["email", "raw_notes"],
+    )
+)
+
+workflow.add_stage("cleanup", cleanup)
+```
+
+This is useful for final cleanup, schema transforms, and format-specific export preparation.
+
+## Current limits
+
+- Stages are linear. DAGs, parallel branches, and joins are planned separately.
+- Stage-level resume is not implemented yet.
+- `push_to_hub()` does not support selected processor or callback outputs yet. Use `export()` for the selected workflow output.
+- `on_success` callbacks are trusted user code. If a callback returns a path, Data Designer reads that path as the next stage input.
+- The artifact layout is intended for inspection, but it is not yet a stable public contract.
@@ -20,6 +20,7 @@ nav:
       - Custom Columns: concepts/custom_columns.md
       - Validators: concepts/validators.md
       - Processors: concepts/processors.md
+      - Workflow Chaining: concepts/workflow-chaining.md
       - Person Sampling: concepts/person_sampling.md
       - Traces: concepts/traces.md
       - Tool Use & MCP:

@@ -28,7 +28,7 @@ class DataDesignerConfig(ExportableConfigBase):
 
     Attributes:
         columns: Required list of column configurations defining how each column
-            should be generated. Must contain at least one column.
+            should be generated. May be empty for seeded processor-only configs.
         model_configs: Optional list of model configurations for LLM-based generation.
             Each model config defines the model, provider, and inference parameters.
         tool_configs: Optional list of tool configurations for MCP tool calling.
@@ -39,7 +39,7 @@ class DataDesignerConfig(ExportableConfigBase):
         processors: Optional list of processor configurations for post-generation transformations.
     """
 
-    columns: list[Annotated[ColumnConfigT, Field(discriminator="column_type")]] = Field(min_length=1)
+    columns: list[Annotated[ColumnConfigT, Field(discriminator="column_type")]]
     model_configs: list[ModelConfig] | None = None
     tool_configs: list[ToolConfig] | None = None
     seed_config: SeedConfig | None = None

@@ -8,6 +8,7 @@
 from data_designer.config.analysis.dataset_profiler import DatasetProfilerResults
 from data_designer.config.config_builder import DataDesignerConfigBuilder
 from data_designer.config.dataset_metadata import DatasetMetadata
+from data_designer.config.seed_source_dataframe import DataFrameSeedSource
 from data_designer.config.utils.visualization import WithRecordSamplerMixin
 
 if TYPE_CHECKING:
@@ -41,3 +42,18 @@ def __init__(
         self.dataset_metadata: DatasetMetadata | None = dataset_metadata
         self.task_traces: list[Any] | None = task_traces
         self._config_builder = config_builder
+
+    def to_config_builder(self, columns: list[str] | None = None) -> DataDesignerConfigBuilder:
+        """Create a new config builder seeded from this preview dataset.
+
+        Copies the full preview dataset in memory; intended for interactive use.
+        """
+        if self.dataset is None:
+            raise ValueError("Preview dataset is not available.")
+        df = self.dataset
+        if columns is not None:
+            df = df.loc[:, columns]
+        return DataDesignerConfigBuilder(
+            model_configs=self._config_builder.model_configs,
+            tool_configs=self._config_builder.tool_configs,
+        ).with_seed_dataset(DataFrameSeedSource(df=df.copy()))
@@ -163,6 +163,10 @@ def __init__(
     def artifact_storage(self) -> ArtifactStorage:
         return self._resource_provider.artifact_storage
 
+    @property
+    def data_designer_config(self) -> DataDesignerConfig:
+        return self._data_designer_config
+
     @property
     def processors(self) -> tuple[Processor, ...]:
         return self._processor_runner.processors

@@ -13,6 +13,7 @@
 if TYPE_CHECKING:
     from data_designer.config.run_config import RunConfig
     from data_designer.engine.mcp.registry import MCPRegistry
+    from data_designer.engine.models.clients.throttle_manager import ThrottleManager
     from data_designer.engine.models.registry import ModelRegistry
 
 
@@ -24,6 +25,7 @@ def create_model_registry(
     mcp_registry: MCPRegistry | None = None,
     client_concurrency_mode: ClientConcurrencyMode = ClientConcurrencyMode.SYNC,
     run_config: RunConfig | None = None,
+    throttle_manager: ThrottleManager | None = None,
 ) -> ModelRegistry:
     """Factory function for creating a ModelRegistry instance.
 
@@ -43,6 +45,8 @@ def create_model_registry(
         run_config: Optional runtime configuration.  The nested
             ``run_config.throttle`` (a ``ThrottleConfig``) is forwarded to the
             ``ThrottleManager`` constructor.
+        throttle_manager: Optional shared throttle manager. When omitted, a new
+            manager is created for this registry.
 
     Returns:
         A configured ModelRegistry instance.
@@ -54,7 +58,8 @@ def create_model_registry(
     from data_designer.engine.models.facade import ModelFacade
     from data_designer.engine.models.registry import ModelRegistry
 
-    throttle_manager = ThrottleManager((run_config or RunConfig()).throttle)
+    if throttle_manager is None:
+        throttle_manager = ThrottleManager((run_config or RunConfig()).throttle)
 
     def model_facade_factory(
         model_config: ModelConfig,

@@ -4,6 +4,7 @@
 from __future__ import annotations
 
 import os
+from typing import TYPE_CHECKING
 
 from data_designer.config.base import ConfigBase
 from data_designer.config.dataset_metadata import DatasetMetadata
@@ -26,6 +27,9 @@
 from data_designer.engine.secret_resolver import SecretResolver
 from data_designer.engine.storage.artifact_storage import ArtifactStorage
 
+if TYPE_CHECKING:
+    from data_designer.engine.models.clients.throttle_manager import ThrottleManager
+
 
 class ResourceType(StrEnum):
     PERSON_READER = "person_reader"
@@ -91,6 +95,7 @@ def create_resource_provider(
     mcp_providers: list[MCPProviderT] | None = None,
     tool_configs: list[ToolConfig] | None = None,
     client_concurrency_mode: ClientConcurrencyMode | None = None,
+    throttle_manager: ThrottleManager | None = None,
 ) -> ResourceProvider:
     """Factory function for creating a ResourceProvider instance.
 
@@ -111,6 +116,7 @@ def create_resource_provider(
         run_config: Optional runtime configuration.
         mcp_providers: Optional list of MCP provider configurations.
         tool_configs: Optional list of tool configurations.
+        throttle_manager: Optional shared throttle manager for model clients.
 
     Returns:
         A configured ResourceProvider instance.
@@ -158,6 +164,7 @@ def create_resource_provider(
             mcp_registry=mcp_registry,
             client_concurrency_mode=client_concurrency_mode,
             run_config=effective_run_config,
+            throttle_manager=throttle_manager,
         ),
         person_reader=person_reader,
         mcp_registry=mcp_registry,

@@ -6,6 +6,7 @@
 import logging
 from abc import ABC, abstractmethod
 from collections.abc import Callable, Iterable, Sequence
+from copy import copy
 from dataclasses import dataclass
 from fnmatch import fnmatchcase
 from pathlib import Path, PurePosixPath
@@ -673,7 +674,9 @@ def add_reader(self, reader: SeedReader) -> Self:
         return self
 
     def get_reader(self, seed_dataset_source: SeedSource, secret_resolver: SecretResolver) -> SeedReader:
-        reader = self._get_reader_for_source(seed_dataset_source)
+        # attach() mutates top-level source/resolver state. Reader subclasses must
+        # not keep nested mutable state shared across attaches.
+        reader = copy(self._get_reader_for_source(seed_dataset_source))
         reader.attach(seed_dataset_source, secret_resolver)
         return reader