Skip to content

feat: agent CLI introspection (simplified)#415

Merged
johnnygreco merged 21 commits intomainfrom
johnny/feat/agent-cli-introspection
Mar 13, 2026
Merged

feat: agent CLI introspection (simplified)#415
johnnygreco merged 21 commits intomainfrom
johnny/feat/agent-cli-introspection

Conversation

@johnnygreco
Copy link
Copy Markdown
Contributor

@johnnygreco johnnygreco commented Mar 13, 2026

Summary

Adds a data-designer agent CLI subgroup that lets AI coding agents dynamically discover Data Designer's configuration types, builder API, and local state. All output is human-readable text.

Also adds schema_text() to ConfigBase so every config type can describe its own fields, and normalizes validator/constraint discriminators to kebab-case.

Agent commands

Command Description
agent context Bootstrap payload combining types, local state, and builder API — recommended starting point
agent types [FAMILY] Type names and import paths for one or all families (columns, samplers, validators, processors, constraints)
agent schema <FAMILY> <TYPE> Human-readable field summary and JSON schema for a single type
agent schema <FAMILY> --all Schemas for every type in a family
agent builder DataDesignerConfigBuilder method signatures and docstrings
agent state model-aliases Configured model aliases, providers, and usability status
agent state persona-datasets Persona locales, dataset sizes, and installation status

Design

  • Commands call plain functions in agent_introspection.py directly — no controller or service layer
  • Text formatting is separated into agent_text_formatter.py
  • schema_text() on ConfigBase (25 lines) replaces the 299-line pydantic_inspector from the previous approach — schema knowledge lives on the types themselves
  • All commands are lazy-loaded so --help stays fast (main.py has zero new imports)

Comparison with johnny/feat/agent-context

This branch delivers the same agent CLI functionality with significantly less complexity:

Metric agent-context This branch Reduction
Source files touched 22 9 59%
Source lines added 1,753 805 54%
Test lines added 2,374 635 73%
Total diff +4,127 / -130 +1,440 / -49 65%
Architectural layers 3 (command → controller → service) 1 (command → plain functions)

Key differences:

  • schema_text() on ConfigBase (25 lines) replaces pydantic_inspector.py (299 lines)
  • No controller or service layer — commands call plain functions directly
  • Human-readable text as the only output mode — simpler interface, no --json/--compact flags
  • Lazy loading preservedmain.py has zero new imports; --help path stays fast

@johnnygreco johnnygreco requested a review from a team as a code owner March 13, 2026 19:24
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 13, 2026

Greptile Summary

This PR introduces a data-designer agent CLI subgroup that lets AI coding agents dynamically discover Data Designer's configuration types, builder API, and local state. It also adds schema_text() to ConfigBase and enriches field definitions across column_configs.py with descriptions.

The implementation is clean and well-factored: a single flat command layer calls plain functions in agent_introspection.py, text formatting is separated into agent_text_formatter.py, and the lazy-loading architecture in main.py is preserved — agent_command_defs.py carries only dataclasses as a transitive dependency.

Key observations:

  • default_provider inconsistency in get_model_aliases_state: The early-return path (when no models.yaml is present) reports default_provider as provider_registry.default verbatim, while the main path falls back to providers[0].name when the explicit default is unset. An agent querying state before any model configs exist would see default_provider: null despite a provider being configured, which could produce incorrect diagnostic output.
  • _get_builder_methods includes __init__: The guard name != "__init__" is intentional but will silently include object.__init__ if DataDesignerConfigBuilder does not define its own __init__, adding a low-value entry to the builder surface output.
  • All previously-flagged issues (builder section in context output, PydanticUndefined for default_factory fields, None rendered as "None" in tables, empty-providers IndexError, test file placement) have been addressed in subsequent commits.

Confidence Score: 4/5

  • Safe to merge with a minor fix — the default_provider inconsistency affects agent-visible output but not core functionality or data integrity.
  • The architecture is sound, all previously-reviewed issues have been addressed, test coverage is thorough (204 schema tests + 104 introspection unit tests + 140 integration tests + 223 formatter tests), and the lazy-loading constraint is preserved. The one logic bug (early-return default_provider inconsistency in get_model_aliases_state) produces misleading output for the exact agent-facing scenario this PR is designed to serve — worth fixing before merge, but it is scoped to a single code path and has no impact on production data generation.
  • packages/data-designer/src/data_designer/cli/utils/agent_introspection.py — the get_model_aliases_state early-return block at line 185.

Important Files Changed

Filename Overview
packages/data-designer-config/src/data_designer/config/base.py Adds schema_text() classmethod to ConfigBase with correct default_factory handling, enum coercion, and a comprehensive Google-style section header set for docstring extraction. The _format_annotation and _get_docstring_summary helpers are clean and well-tested.
packages/data-designer/src/data_designer/cli/utils/agent_introspection.py Core introspection module. The get_model_aliases_state early-return path (model_registry is None) reports default_provider using provider_registry.default directly, while the main path falls back to providers[0].name — causing an inconsistency that can mislead agents about provider availability. Everything else (family discovery, schema building, builder API introspection, registry loading) is correct and well-guarded.
packages/data-designer/src/data_designer/cli/utils/agent_text_formatter.py All previously-flagged issues (builder section, None rendering) are addressed. The _cell() helper correctly coerces None to empty string in both width calculation and cell rendering. All six output formatters are present and structurally correct.
packages/data-designer/src/data_designer/cli/main.py Adds agent and agent state Typer subgroups using the existing create_lazy_typer_group mechanism. _build_agent_lazy_group correctly partitions top-level vs. dotted (state.*) commands. The new import of AGENT_COMMANDS from agent_command_defs is cheap (stdlib-only transitive deps) and does not materially affect --help startup time.
packages/data-designer/src/data_designer/cli/commands/agent.py Thin command layer; all six commands delegate to plain functions via the _run helper, which correctly differentiates AgentIntrospectionError (structured error code) from generic exceptions (internal_error). Clean and consistent.

Sequence Diagram

sequenceDiagram
    participant Agent
    participant CLI as data-designer agent
    participant Cmd as commands/agent.py
    participant Intro as agent_introspection.py
    participant Fmt as agent_text_formatter.py

    Agent->>CLI: agent context
    CLI->>Cmd: context_command()
    Cmd->>Intro: get_context(DATA_DESIGNER_HOME)
    Intro->>Intro: get_family_catalog() × 5 families
    Intro->>Intro: get_model_aliases_state()
    Intro->>Intro: get_persona_datasets_state()
    Intro->>Intro: get_builder_api()
    Intro-->>Cmd: dict payload
    Cmd->>Fmt: format_context_text(data)
    Fmt-->>Cmd: formatted text
    Cmd-->>Agent: stdout (human-readable)

    Agent->>CLI: agent schema columns llm-text
    CLI->>Cmd: schema_command(family, type_name)
    Cmd->>Intro: get_schema("columns", "llm-text")
    Intro->>Intro: discover_family_types()
    Intro->>Intro: _build_schema_dict() → cls.schema_text()
    Intro-->>Cmd: schema dict
    Cmd->>Fmt: format_schema_text(data)
    Fmt-->>Agent: stdout (schema_text output)
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/data-designer/src/data_designer/cli/utils/agent_introspection.py
Line: 185-191

Comment:
**`default_provider` logic inconsistent with main path**

The early-return path (when `model_registry is None`) reports `default_provider` as `provider_registry.default` directly. If a user has a `providers.yaml` with providers but no explicit `default` set, `provider_registry.default` is `None` — so the output shows `default_provider: null`. 

In contrast, the main code path (lines 198–200) falls back to `provider_registry.providers[0].name` when `provider_registry.default` is falsy:

```python
default_provider = provider_registry.default or (
    provider_registry.providers[0].name if provider_registry.providers else None
)
```

An agent running `agent state model-aliases` before any models are configured (e.g., after `data-designer config providers` but before `data-designer config models`) would see `default_provider: null` and could incorrectly diagnose the setup as having no provider, when one is actually available. The fix is to apply the same fallback in the early return:

```suggestion
    if model_registry is None:
        if provider_registry is not None:
            computed_default = provider_registry.default or (
                provider_registry.providers[0].name if provider_registry.providers else None
            )
        else:
            computed_default = None
        return {
            "model_config_present": False,
            "provider_config_present": provider_registry is not None,
            "default_provider": computed_default,
            "items": items,
        }
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/data-designer/src/data_designer/cli/utils/agent_introspection.py
Line: 282-285

Comment:
**`__init__` included but condition is unintuitive**

The filter `name.startswith("_") and name != "__init__"` explicitly retains `__init__` while excluding all other private/dunder methods. If `DataDesignerConfigBuilder.__init__` has parameters and a docstring, this is intentional. However, if `__init__` is inherited from a base class or is just the default `object.__init__`, the method signature `__init__() -> None` would appear in the builder output but provide no useful information to agents.

Consider only including `__init__` when it is actually defined on `DataDesignerConfigBuilder` itself (not inherited), to avoid a spurious method entry from `object.__init__`:

```suggestion
    for name, attr in inspect.getmembers(DataDesignerConfigBuilder):
        if name.startswith("_") and not (name == "__init__" and "__init__" in DataDesignerConfigBuilder.__dict__):
            continue
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 0bb974b

Comment thread packages/data-designer-config/src/data_designer/config/base.py Outdated
Comment thread packages/data-designer/src/data_designer/cli/utils/agent_introspection.py Outdated
@johnnygreco johnnygreco force-pushed the johnny/feat/agent-cli-introspection branch from a6214a9 to 5283f97 Compare March 13, 2026 19:48
Comment thread packages/data-designer/tests/cli/commands/test_agent_command.py
nabinchha
nabinchha previously approved these changes Mar 13, 2026
Copy link
Copy Markdown
Contributor

@nabinchha nabinchha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary: Adds a data-designer agent CLI subgroup that lets AI coding agents dynamically discover Data Designer's configuration types, builder API, and local state via human-readable text output. Also adds schema_text() to ConfigBase and normalizes validator/constraint discriminators to concrete Literal fields. The implementation matches the stated intent well — the simplified single-layer architecture (commands call plain functions directly) is a significant improvement over the prior 3-layer approach.

Findings

Critical — Must fix before merge

No critical issues found.

Warnings — Strongly recommend fixing

packages/data-designer/src/data_designer/cli/utils/agent_text_formatter.py:119,128or '' coercion suppresses falsy values like 0 and False

  • What: row.get(col) or '' coerces None to empty string (good), but also coerces 0, False, and empty string "" to empty string. The usable column in model aliases contains boolean True/False values. When usable is False, row.get("usable") or '' evaluates to '', so the table would show an empty cell instead of False.
  • Why: An agent reading the model-aliases table would see an empty usable column for unusable models, losing a key diagnostic signal.
  • Suggestion: Use an explicit None check:
def _cell_value(val: Any) -> str:
    return "" if val is None else str(val)

Then use _cell_value(row.get(col)) in both the width calculation and row rendering.

packages/data-designer/src/data_designer/cli/main.py:110 — Implicit and/or precedence in filter expression

  • What: The condition if prefix == "" and "." not in cmd.name or cmd.name.startswith(f"{prefix}.") relies on and binding tighter than or. While logically correct, it reads as ambiguous.
  • Why: A future maintainer adding a new command pattern could easily misread the intent and introduce a routing bug.
  • Suggestion: Add explicit parentheses:
if (prefix == "" and "." not in cmd.name) or (prefix != "" and cmd.name.startswith(f"{prefix}."))

packages/data-designer-config/src/data_designer/config/sampler_constraints.py:27-29Constraint base class is no longer abstract but not meant to be instantiated

  • What: The base Constraint class was previously ABC but is now a plain ConfigBase subclass. It declares constraint_type: ConstraintType as required (no default), so direct instantiation would fail with a confusing validation error.
  • Why: If someone instantiates Constraint directly, the error won't clearly indicate they should use a subclass.
  • Suggestion: Restore ABC on Constraint — it was abstract before and still shouldn't be instantiated directly.

packages/data-designer/tests/cli/commands/test_agent_command.py:57-65test_error_outputs_message_to_stderr may fail depending on Typer version

  • What: The test accesses result.stdout and result.stderr separately, but Typer's CliRunner behavior around stderr separation depends on the Typer/Click version pinned.
  • Why: A test that errors instead of testing the intended behavior provides false confidence.
  • Suggestion: Confirm this test passes in CI. If it doesn't, use CliRunner(mix_stderr=False) explicitly.

Suggestions — Consider improving

packages/data-designer-config/tests/config/test_schema_text.py:149 — Single-letter loop variable l

  • What: next(i for i, l in enumerate(lines) if "name: str" in l) uses l which is visually similar to 1.
  • Suggestion: Rename to line.

packages/data-designer/tests/cli/utils/test_agent_text_formatter.py:219 — Import inside test function

  • What: test_format_schema_text_on_real_config_models imports get_family_schema inside the function body rather than at module level.
  • Suggestion: Move the import to the top of the file per AGENTS.md guidelines.

What Looks Good

  • schema_text() on ConfigBase is an elegant 25-line solution replacing 299 lines of pydantic_inspector. Putting schema knowledge on the types themselves is the right design — discoverable, testable, and naturally stays in sync as fields change.

  • Clean data/presentation separation: agent_introspection.py returns plain dicts, agent_text_formatter.py renders them. Both are independently testable and leave the door open for adding JSON output later with minimal effort.

  • Thorough edge-case testing for schema_text(): The 15 tests in test_schema_text.py cover default_factory, enum defaults, None defaults, scalar defaults, descriptions, docstrings, and the mixed-model integration case. The PydanticUndefined regression test is particularly valuable.

Verdict

Ship it (with nits) — The core design is sound, the implementation is clean, and the test coverage is good. The or '' falsy-value suppression is the most impactful issue to fix before merge since it silently drops False values from the model-aliases table. The other warnings are lower priority but worth addressing. Linter passes clean on all changed files.

@johnnygreco
Copy link
Copy Markdown
Contributor Author

Thanks for the review @nabinchha! All items addressed:

Warnings — fixed:

  • or '' suppressing False — Fixed in 80076c4. _cell() helper now uses an explicit if value is None check so False and 0 render correctly.
  • Operator precedence in main.py:110 — Fixed in 6d85877. Added explicit parentheses.
  • Restore ABC on Constraint — Fixed in 99bab37. ABC restored to match main. (Note: Pydantic overrides __init__ so ABC alone doesn't prevent instantiation, but it signals intent.)
  • CliRunner stderr test — Confirmed passing in CI. Typer's CliRunner captures stderr separately by default (no mix_stderr parameter), unlike Click's.

Suggestions — fixed:

  • Loop variable l — Renamed to line in 6d85877.
  • Import inside test function — Moved get_family_schema to module level in 6d85877.

ConfigBase.schema_text() returns a concise text representation including
the class docstring summary, field names, types, defaults, and
descriptions. Field descriptions added to column config types to
surface through this method.
Delete AgentController class and agent_command_defs module. Move all
logic into agent_introspection (data) and agent_text_formatter (display)
as plain functions. Add --json flag so commands default to human-readable
text using schema_text(), with JSON as opt-in. Unify _emit helper,
remove include_docstrings parameter, deduplicate catalog calls, and fix
N+1 discover_family_types in get_family_schemas.
Port test_agent_controller.py to use plain functions instead of deleted
AgentController. Extract AGENT_COMMANDS constant as single source for
operation descriptions, syncing with main.py help strings.
Extract AGENT_COMMANDS into agent_command_defs.py so main.py and
agent_introspection.py share a single source for command names,
help text, and metadata. The new module has no heavy dependencies,
keeping --help latency unaffected.
…trospection

- schema_text() now detects default_factory fields and renders e.g. "list()"
  instead of leaking PydanticUndefined
- Guard against IndexError when provider registry has an empty providers list
- Add 15 edge-case tests for schema_text covering default_factory, enum
  defaults, None defaults, scalar defaults, descriptions, and docstrings
Text-only output simplifies the interface. Structured output can be
added back trivially since the functions already return dicts.
- format_context_text now renders a ## Builder section
- format_types_text now includes import_path column in tables
All config objects are imported via dd.<ClassName>, so the full import
path is redundant noise in agent output.
row.get(col, '') returns None when the key exists with value None,
causing str(None) to render "None" in the output. Use `or ''` instead.
…ration tests

There is no controller layer — these tests exercise functions in
agent_introspection.py, so they belong in tests/cli/utils/.
The previous `or ''` pattern treated all falsy values (including False)
as empty. Use an explicit None check so booleans render correctly.
- Add explicit parentheses to and/or precedence in _build_agent_lazy_group
- Rename loop variable l to line in test_schema_text
- Move get_family_schema import to module level in test_agent_text_formatter
@johnnygreco johnnygreco force-pushed the johnny/feat/agent-cli-introspection branch from 99bab37 to 5926c7e Compare March 13, 2026 21:55
nabinchha
nabinchha previously approved these changes Mar 13, 2026
Comment thread packages/data-designer-config/src/data_designer/config/base.py Outdated
…nd docstring parsing

- _format_annotation now renders Literal['value'] instead of bare Literal
- _format_signature strips quotes from stringified annotations caused by
  `from __future__ import annotations`
- _get_docstring_summary stops at any Google-style section header, not
  just Attributes:
@johnnygreco johnnygreco merged commit 4c19dba into main Mar 13, 2026
47 checks passed
Comment on lines +185 to +191
if model_registry is None:
return {
"model_config_present": False,
"provider_config_present": provider_registry is not None,
"default_provider": None if provider_registry is None else provider_registry.default,
"items": items,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default_provider logic inconsistent with main path

The early-return path (when model_registry is None) reports default_provider as provider_registry.default directly. If a user has a providers.yaml with providers but no explicit default set, provider_registry.default is None — so the output shows default_provider: null.

In contrast, the main code path (lines 198–200) falls back to provider_registry.providers[0].name when provider_registry.default is falsy:

default_provider = provider_registry.default or (
    provider_registry.providers[0].name if provider_registry.providers else None
)

An agent running agent state model-aliases before any models are configured (e.g., after data-designer config providers but before data-designer config models) would see default_provider: null and could incorrectly diagnose the setup as having no provider, when one is actually available. The fix is to apply the same fallback in the early return:

Suggested change
if model_registry is None:
return {
"model_config_present": False,
"provider_config_present": provider_registry is not None,
"default_provider": None if provider_registry is None else provider_registry.default,
"items": items,
}
if model_registry is None:
if provider_registry is not None:
computed_default = provider_registry.default or (
provider_registry.providers[0].name if provider_registry.providers else None
)
else:
computed_default = None
return {
"model_config_present": False,
"provider_config_present": provider_registry is not None,
"default_provider": computed_default,
"items": items,
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/data-designer/src/data_designer/cli/utils/agent_introspection.py
Line: 185-191

Comment:
**`default_provider` logic inconsistent with main path**

The early-return path (when `model_registry is None`) reports `default_provider` as `provider_registry.default` directly. If a user has a `providers.yaml` with providers but no explicit `default` set, `provider_registry.default` is `None` — so the output shows `default_provider: null`. 

In contrast, the main code path (lines 198–200) falls back to `provider_registry.providers[0].name` when `provider_registry.default` is falsy:

```python
default_provider = provider_registry.default or (
    provider_registry.providers[0].name if provider_registry.providers else None
)
```

An agent running `agent state model-aliases` before any models are configured (e.g., after `data-designer config providers` but before `data-designer config models`) would see `default_provider: null` and could incorrectly diagnose the setup as having no provider, when one is actually available. The fix is to apply the same fallback in the early return:

```suggestion
    if model_registry is None:
        if provider_registry is not None:
            computed_default = provider_registry.default or (
                provider_registry.providers[0].name if provider_registry.providers else None
            )
        else:
            computed_default = None
        return {
            "model_config_present": False,
            "provider_config_present": provider_registry is not None,
            "default_provider": computed_default,
            "items": items,
        }
```

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants