Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 107 additions & 0 deletions docs/concepts/workflow-chaining.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Workflow Chaining

!!! warning "Experimental Feature"
Workflow chaining is currently **experimental** and under active development. The documentation, examples, workflow API, metadata schema, and artifact layout are subject to significant changes in future releases. If you encounter any issues, have questions, or have ideas for improvement, please consider starting [a discussion on GitHub](https://github.com/NVIDIA-NeMo/DataDesigner/discussions).

Workflow chaining lets you split a dataset build into named stages. Each stage runs a normal `DataDesigner.create()` call, writes its own artifact directory, and hands a selected parquet output to the next stage as a `LocalFileSeedSource`.

Use it when one generation step naturally depends on the cleaned or reshaped output of another step, especially when a processor-only stage is clearer than mixing all transformations into one config.

## Basic shape

```python
import data_designer.config as dd
from data_designer.interface import DataDesigner

data_designer = DataDesigner()

drafts = (
dd.DataDesignerConfigBuilder(model_configs=[fast_model])
.with_seed_dataset(dd.LocalFileSeedSource(path="parsed_docs/*.parquet"))
.add_column(
name="chunk_summary",
column_type="llm_text",
model_alias="fast",
prompt="Summarize this passage:\n\n{{ text }}",
)
.add_column(
name="question",
column_type="llm_text",
model_alias="fast",
prompt="Write a question about this passage:\n\n{{ chunk_summary }}",
)
.add_column(
name="answer",
column_type="llm_text",
model_alias="fast",
prompt="Answer {{ question }} using this passage:\n\n{{ text }}",
)
)

chatml = dd.DataDesignerConfigBuilder().add_processor(
dd.SchemaTransformProcessorConfig(
name="chatml",
template={
"messages": [
{"role": "user", "content": "{{ question }}"},
{"role": "assistant", "content": "{{ answer }}"},
],
},
)
)

workflow = data_designer.compose_workflow(name="doc-qa")
workflow.add_stage(
"drafts",
drafts,
num_records=100,
output_processors=[
dd.DropColumnsProcessorConfig(
name="drop_scratch",
column_names=["text", "chunk_summary"],
)
],
)
workflow.add_stage("chatml", chatml, output="processor:chatml")

results = workflow.run()
training_rows = results.load_dataset()
results.export("chatml.jsonl")
```

## Stage outputs

A stage can expose different views of its data:

| Surface | What it returns |
|---------|-----------------|
| `results["stage_name"]` | The effective `DatasetCreationResults` for that stage. If the stage uses `output_processors`, this points at the output-processor run. |
| `results.load_stage_output("stage_name")` | The selected output handed to downstream stages. This follows `output="processor:<name>"` and `on_success`. |
| `results.load_dataset()` | The selected output from the final stage. |

Processors added with `config_builder.add_processor(...)` run inside the stage and usually create side artifacts. They do not automatically change what the next stage receives. Use `output_processors=[...]` when a processor should define the stage boundary output.

## Processor-only stages

Stages can be processor-only when they receive seed data from an upstream stage:

```python
cleanup = dd.DataDesignerConfigBuilder().add_processor(
dd.DropColumnsProcessorConfig(
name="drop_private_fields",
column_names=["email", "raw_notes"],
)
)

workflow.add_stage("cleanup", cleanup)
```

This is useful for final cleanup, schema transforms, and format-specific export preparation.

## Current limits

- Stages are linear. DAGs, parallel branches, and joins are planned separately.
- Stage-level resume is not implemented yet.
- `push_to_hub()` does not support selected processor or callback outputs yet. Use `export()` for the selected workflow output.
- `on_success` callbacks are trusted user code. If a callback returns a path, Data Designer reads that path as the next stage input.
- The artifact layout is intended for inspection, but it is not yet a stable public contract.
2 changes: 2 additions & 0 deletions fern/versions/latest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,8 @@ navigation:
path: ./latest/pages/concepts/validators.mdx
- page: Processors
path: ./latest/pages/concepts/processors.mdx
- page: Workflow Chaining
path: ./latest/pages/concepts/workflow-chaining.mdx
- page: Person Sampling
path: ./latest/pages/concepts/person_sampling.mdx
- page: Traces
Expand Down
111 changes: 111 additions & 0 deletions fern/versions/latest/pages/concepts/workflow-chaining.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
---
title: "Workflow Chaining"
description: ""
position: 8
---
<Warning title="Experimental Feature">
Workflow chaining is currently **experimental** and under active development. The documentation, examples, workflow API, metadata schema, and artifact layout are subject to significant changes in future releases. If you encounter any issues, have questions, or have ideas for improvement, please consider starting [a discussion on GitHub](https://github.com/NVIDIA-NeMo/DataDesigner/discussions).
</Warning>

Workflow chaining lets you split a dataset build into named stages. Each stage runs a normal `DataDesigner.create()` call, writes its own artifact directory, and hands a selected parquet output to the next stage as a `LocalFileSeedSource`.

Use it when one generation step naturally depends on the cleaned or reshaped output of another step, especially when a processor-only stage is clearer than mixing all transformations into one config.

## Basic shape

```python
import data_designer.config as dd
from data_designer.interface import DataDesigner

data_designer = DataDesigner()

drafts = (
dd.DataDesignerConfigBuilder(model_configs=[fast_model])
.with_seed_dataset(dd.LocalFileSeedSource(path="parsed_docs/*.parquet"))
.add_column(
name="chunk_summary",
column_type="llm_text",
model_alias="fast",
prompt="Summarize this passage:\n\n{{ text }}",
)
.add_column(
name="question",
column_type="llm_text",
model_alias="fast",
prompt="Write a question about this passage:\n\n{{ chunk_summary }}",
)
.add_column(
name="answer",
column_type="llm_text",
model_alias="fast",
prompt="Answer {{ question }} using this passage:\n\n{{ text }}",
)
)

chatml = dd.DataDesignerConfigBuilder().add_processor(
dd.SchemaTransformProcessorConfig(
name="chatml",
template={
"messages": [
{"role": "user", "content": "{{ question }}"},
{"role": "assistant", "content": "{{ answer }}"},
],
},
)
)

workflow = data_designer.compose_workflow(name="doc-qa")
workflow.add_stage(
"drafts",
drafts,
num_records=100,
output_processors=[
dd.DropColumnsProcessorConfig(
name="drop_scratch",
column_names=["text", "chunk_summary"],
)
],
)
workflow.add_stage("chatml", chatml, output="processor:chatml")

results = workflow.run()
training_rows = results.load_dataset()
results.export("chatml.jsonl")
```

## Stage outputs

A stage can expose different views of its data:

| Surface | What it returns |
|---------|-----------------|
| `results["stage_name"]` | The effective `DatasetCreationResults` for that stage. If the stage uses `output_processors`, this points at the output-processor run. |
| `results.load_stage_output("stage_name")` | The selected output handed to downstream stages. This follows `output="processor:<name>"` and `on_success`. |
| `results.load_dataset()` | The selected output from the final stage. |

Processors added with `config_builder.add_processor(...)` run inside the stage and usually create side artifacts. They do not automatically change what the next stage receives. Use `output_processors=[...]` when a processor should define the stage boundary output.

## Processor-only stages

Stages can be processor-only when they receive seed data from an upstream stage:

```python
cleanup = dd.DataDesignerConfigBuilder().add_processor(
dd.DropColumnsProcessorConfig(
name="drop_private_fields",
column_names=["email", "raw_notes"],
)
)

workflow.add_stage("cleanup", cleanup)
```

This is useful for final cleanup, schema transforms, and format-specific export preparation.

## Current limits

- Stages are linear. DAGs, parallel branches, and joins are planned separately.
- Stage-level resume is not implemented yet.
- `push_to_hub()` does not support selected processor or callback outputs yet. Use `export()` for the selected workflow output.
- `on_success` callbacks are trusted user code. If a callback returns a path, Data Designer reads that path as the next stage input.
- The artifact layout is intended for inspection, but it is not yet a stable public contract.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ nav:
- Custom Columns: concepts/custom_columns.md
- Validators: concepts/validators.md
- Processors: concepts/processors.md
- Workflow Chaining: concepts/workflow-chaining.md
- Person Sampling: concepts/person_sampling.md
- Traces: concepts/traces.md
- Tool Use & MCP:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ class DataDesignerConfig(ExportableConfigBase):

Attributes:
columns: Required list of column configurations defining how each column
should be generated. Must contain at least one column.
should be generated. May be empty for seeded processor-only configs.
model_configs: Optional list of model configurations for LLM-based generation.
Each model config defines the model, provider, and inference parameters.
tool_configs: Optional list of tool configurations for MCP tool calling.
Expand All @@ -39,7 +39,7 @@ class DataDesignerConfig(ExportableConfigBase):
processors: Optional list of processor configurations for post-generation transformations.
"""

columns: list[Annotated[ColumnConfigT, Field(discriminator="column_type")]] = Field(min_length=1)
columns: list[Annotated[ColumnConfigT, Field(discriminator="column_type")]]
model_configs: list[ModelConfig] | None = None
tool_configs: list[ToolConfig] | None = None
seed_config: SeedConfig | None = None
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from data_designer.config.analysis.dataset_profiler import DatasetProfilerResults
from data_designer.config.config_builder import DataDesignerConfigBuilder
from data_designer.config.dataset_metadata import DatasetMetadata
from data_designer.config.seed_source_dataframe import DataFrameSeedSource
from data_designer.config.utils.visualization import WithRecordSamplerMixin

if TYPE_CHECKING:
Expand Down Expand Up @@ -41,3 +42,18 @@ def __init__(
self.dataset_metadata: DatasetMetadata | None = dataset_metadata
self.task_traces: list[Any] | None = task_traces
self._config_builder = config_builder

def to_config_builder(self, columns: list[str] | None = None) -> DataDesignerConfigBuilder:
"""Create a new config builder seeded from this preview dataset.

Copies the full preview dataset in memory; intended for interactive use.
"""
if self.dataset is None:
raise ValueError("Preview dataset is not available.")
df = self.dataset
if columns is not None:
df = df.loc[:, columns]
return DataDesignerConfigBuilder(
model_configs=self._config_builder.model_configs,
tool_configs=self._config_builder.tool_configs,
).with_seed_dataset(DataFrameSeedSource(df=df.copy()))
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,10 @@ def __init__(
def artifact_storage(self) -> ArtifactStorage:
return self._resource_provider.artifact_storage

@property
def data_designer_config(self) -> DataDesignerConfig:
return self._data_designer_config

@property
def processors(self) -> tuple[Processor, ...]:
return self._processor_runner.processors
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
if TYPE_CHECKING:
from data_designer.config.run_config import RunConfig
from data_designer.engine.mcp.registry import MCPRegistry
from data_designer.engine.models.clients.throttle_manager import ThrottleManager
from data_designer.engine.models.registry import ModelRegistry


Expand All @@ -24,6 +25,7 @@ def create_model_registry(
mcp_registry: MCPRegistry | None = None,
client_concurrency_mode: ClientConcurrencyMode = ClientConcurrencyMode.SYNC,
run_config: RunConfig | None = None,
throttle_manager: ThrottleManager | None = None,
) -> ModelRegistry:
"""Factory function for creating a ModelRegistry instance.

Expand All @@ -43,6 +45,8 @@ def create_model_registry(
run_config: Optional runtime configuration. The nested
``run_config.throttle`` (a ``ThrottleConfig``) is forwarded to the
``ThrottleManager`` constructor.
throttle_manager: Optional shared throttle manager. When omitted, a new
manager is created for this registry.

Returns:
A configured ModelRegistry instance.
Expand All @@ -54,7 +58,8 @@ def create_model_registry(
from data_designer.engine.models.facade import ModelFacade
from data_designer.engine.models.registry import ModelRegistry

throttle_manager = ThrottleManager((run_config or RunConfig()).throttle)
if throttle_manager is None:
throttle_manager = ThrottleManager((run_config or RunConfig()).throttle)

def model_facade_factory(
model_config: ModelConfig,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
from __future__ import annotations

import os
from typing import TYPE_CHECKING

from data_designer.config.base import ConfigBase
from data_designer.config.dataset_metadata import DatasetMetadata
Expand All @@ -26,6 +27,9 @@
from data_designer.engine.secret_resolver import SecretResolver
from data_designer.engine.storage.artifact_storage import ArtifactStorage

if TYPE_CHECKING:
from data_designer.engine.models.clients.throttle_manager import ThrottleManager


class ResourceType(StrEnum):
PERSON_READER = "person_reader"
Expand Down Expand Up @@ -91,6 +95,7 @@ def create_resource_provider(
mcp_providers: list[MCPProviderT] | None = None,
tool_configs: list[ToolConfig] | None = None,
client_concurrency_mode: ClientConcurrencyMode | None = None,
throttle_manager: ThrottleManager | None = None,
) -> ResourceProvider:
"""Factory function for creating a ResourceProvider instance.

Expand All @@ -111,6 +116,7 @@ def create_resource_provider(
run_config: Optional runtime configuration.
mcp_providers: Optional list of MCP provider configurations.
tool_configs: Optional list of tool configurations.
throttle_manager: Optional shared throttle manager for model clients.

Returns:
A configured ResourceProvider instance.
Expand Down Expand Up @@ -158,6 +164,7 @@ def create_resource_provider(
mcp_registry=mcp_registry,
client_concurrency_mode=client_concurrency_mode,
run_config=effective_run_config,
throttle_manager=throttle_manager,
),
person_reader=person_reader,
mcp_registry=mcp_registry,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import logging
from abc import ABC, abstractmethod
from collections.abc import Callable, Iterable, Sequence
from copy import copy
from dataclasses import dataclass
from fnmatch import fnmatchcase
from pathlib import Path, PurePosixPath
Expand Down Expand Up @@ -673,7 +674,9 @@ def add_reader(self, reader: SeedReader) -> Self:
return self

def get_reader(self, seed_dataset_source: SeedSource, secret_resolver: SecretResolver) -> SeedReader:
reader = self._get_reader_for_source(seed_dataset_source)
# attach() mutates top-level source/resolver state. Reader subclasses must
# not keep nested mutable state shared across attaches.
reader = copy(self._get_reader_for_source(seed_dataset_source))
reader.attach(seed_dataset_source, secret_resolver)
return reader

Expand Down
Loading
Loading