Skip to content

Consolidate Data Designer user configuration into a single TOML file #694

@eric-tramel

Description

@eric-tramel

Priority Level

Medium (Nice to have)

Is your feature request related to a problem? Please describe.

Data Designer's user-level configuration under DATA_DESIGNER_HOME (~/.data-designer by default) is currently split across several YAML files:

  • model_configs.yaml
  • model_providers.yaml
  • mcp_providers.yaml
  • tool_configs.yaml
  • plugin_catalogs.yaml for CLI plugin catalogs

Issue #559 would add another persisted runtime configuration file, run_config.yaml, for system-level RunConfig defaults.

That layout was reasonable when the config surface was small, but every new concern now adds another file, repository, list/reset path, migration path, and documentation branch. Users who want to inspect, back up, review, or share their local Data Designer settings have to understand a directory of loosely related files instead of one canonical configuration document.

The fragmentation also makes feature design harder. #559 is a good example: runtime defaults clearly belong in user/system-level configuration, but adding yet another YAML file would deepen the current split.

Describe the solution you'd like

Introduce a single canonical TOML file for Data Designer user/system configuration, for example:

~/.data-designer/config.toml

or:

~/.data-designer/data-designer.toml

The exact filename can be decided during implementation, but the important design goal is one typed, versioned, human-editable file for Data Designer's local configuration.

Suggested shape:

version = 1

[[model.providers]]
name = "openai"
base_url = "https://api.openai.com/v1"
api_key_env = "OPENAI_API_KEY"

[[model.configs]]
alias = "openai-text"
provider = "openai"
model = "gpt-4.1"

[model.configs.inference_parameters]
generation_type = "chat-completion"
temperature = 0.85
top_p = 0.95

[[mcp.providers]]
name = "local-tools"
type = "stdio"

[[tools.configs]]
tool_alias = "default"
providers = ["local-tools"]

[[plugins.catalogs]]
alias = "nvidia"
url = "https://nvidia-nemo.github.io/DataDesignerPlugins/catalog/plugins.json"

[run]
buffer_size = 1000
disable_early_shutdown = false
progress_bar = false
progress_interval = 5.0

[run.throttle]
reduce_factor = 0.75
additive_increase = 1
success_window = 25
cooldown_seconds = 2.0
ceiling_overshoot = 0.10

Desired behavior:

  • DATA_DESIGNER_HOME remains the config root override.
  • The single TOML file becomes the canonical write target for user/system-level config.
  • Existing YAML files remain readable for a migration window.
  • When both the TOML file and legacy YAML files exist, precedence should be explicit and documented. A reasonable default is: TOML wins, legacy files are fallback only.
  • Add a migration command or automatic prompted migration, such as data-designer config migrate, to merge existing YAML files into the TOML file.
  • data-designer config list should show the unified file path and all loaded sections.
  • data-designer config reset should operate on the unified file, with section-level reset support if practical.
  • Invalid TOML or invalid section schemas should fail loudly with file and section context.
  • Dataset/workflow config files should remain separate from this proposal. This issue is about user/system configuration under ~/.data-designer, not replacing normal dataset YAML/JSON configs.

Impact on #559:

Implementation notes:

  • Add a root Pydantic model for the unified config file with explicit schema versioning.
  • Replace per-file repository write paths with section-aware repository access over one file, or introduce a shared config store that the existing repositories delegate to during migration.
  • Keep config concerns logically separated inside the TOML structure even though they share one file.
  • Decide on TOML parser/writer support deliberately: Data Designer supports Python >=3.10, and the standard library does not cover all read/write needs across supported versions.
  • Preserve comments and ordering if possible, since this file is meant to be edited by users.

Acceptance criteria:

  • A single TOML file can represent model providers, model configs, MCP providers, tool configs, plugin catalogs, and runtime defaults.
  • Existing YAML configs can still be loaded during a documented compatibility period.
  • A migration path combines existing YAML config files into the TOML file without losing settings.
  • CLI config commands read/write the TOML file as the canonical source.
  • Support system-level RunConfig defaults in ~/.data-designer #559's runtime-default design is either implemented directly in TOML or migrated cleanly from run_config.yaml.
  • Tests cover absent TOML, malformed TOML, section validation errors, legacy fallback, TOML-vs-legacy precedence, migration, and reset/list behavior.
  • Documentation explains the new file, the section schema, the migration path, and the relationship to dataset config files.

Describe alternatives you've considered

  • Keep adding one YAML file per config concern.
    • This is simple locally, but the directory gets harder to understand and every new feature repeats the same repository/list/reset/migration work.
  • Consolidate into a single YAML file instead of TOML.
    • This would reduce file count, but TOML is a better fit for local application configuration: it is familiar from pyproject.toml, supports clear table sections, and is comfortable for hand editing.
  • Store runtime defaults in model_configs.yaml or model_providers.yaml.
    • This would reduce the number of files, but it mixes unrelated concerns and makes the schema harder to reason about.
  • Use environment variables for everything.
    • That works for a few scalar toggles, but it does not scale to nested provider/model/tool/runtime configuration.

Agent Investigation

  • Searched existing open and closed issues for TOML configuration, single config files, config consolidation, and ~/.data-designer consolidation. I did not find a direct duplicate.
  • DATA_DESIGNER_HOME is defined in packages/data-designer-config/src/data_designer/config/utils/constants.py and defaults to ~/.data-designer.
  • Current top-level config file constants include model_configs.yaml, model_providers.yaml, mcp_providers.yaml, and tool_configs.yaml.
  • The CLI plugin catalog repository currently persists plugin_catalogs.yaml separately.
  • Support system-level RunConfig defaults in ~/.data-designer #559 proposes run_config.yaml for persisted system-level RunConfig defaults, which would be impacted by this consolidation.
  • The CLI currently has per-concern repositories for model configs, model providers, MCP providers, tool configs, and plugin catalogs. Those are natural migration points for a section-aware unified config store.
  • Existing config file loading supports .yaml, .yml, and .json for workflow configs; TOML would be a new format for user/system config unless broader support is intentionally added.

Additional context

This proposal is about the user/system configuration stored under DATA_DESIGNER_HOME. It should not collapse generated artifacts, managed asset caches, version-check caches, or normal dataset/workflow config files into the same TOML file.

The main goal is to make local Data Designer configuration easier to inspect, migrate, document, and extend before more user-level settings accumulate.

Checklist

  • I've reviewed existing issues and the documentation
  • This is a design proposal, not a "please build this" request

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions