From 86a610124c0bc044b898be0e25f08a032e2562d0 Mon Sep 17 00:00:00 2001 From: Nabin Mulepati Date: Mon, 23 Mar 2026 17:22:59 -0600 Subject: [PATCH 1/7] docs: restructure agent and contributor documentation Restructure AGENTS.md from ~627 lines to ~55 lines of high-signal architectural invariants. Extract code style into STYLEGUIDE.md and development workflow into DEVELOPMENT.md. Overhaul CONTRIBUTING.md to reflect agent-assisted development as the primary workflow. Move skills and sub-agents from .claude/ to .agents/ as the tool-agnostic home, with symlinks back for Claude Code compatibility. Add architecture/ skeleton with 10 stub files for incremental population. Implements PR 1 of #427. Made-with: Cursor --- .agents/README.md | 23 + {.claude => .agents}/agents/docs-searcher.md | 0 .../agents/github-searcher.md | 0 {.claude => .agents}/skills/commit/SKILL.md | 0 .../skills/create-pr/SKILL.md | 0 {.claude => .agents}/skills/new-sdg/SKILL.md | 0 .../skills/review-code/SKILL.md | 17 +- .../skills/search-docs/SKILL.md | 0 .../skills/search-github/SKILL.md | 0 .../skills/update-pr/SKILL.md | 0 .claude/agents | 1 + .claude/skills | 1 + AGENTS.md | 645 +----------------- CONTRIBUTING.md | 259 ++----- DEVELOPMENT.md | 195 ++++++ README.md | 4 +- STYLEGUIDE.md | 361 ++++++++++ architecture/agent-introspection.md | 21 + architecture/cli.md | 20 + architecture/config.md | 21 + architecture/dataset-builders.md | 21 + architecture/engine.md | 22 + architecture/mcp.md | 21 + architecture/models.md | 20 + architecture/overview.md | 22 + architecture/plugins.md | 21 + architecture/sampling.md | 21 + 27 files changed, 901 insertions(+), 815 deletions(-) create mode 100644 .agents/README.md rename {.claude => .agents}/agents/docs-searcher.md (100%) rename {.claude => .agents}/agents/github-searcher.md (100%) rename {.claude => .agents}/skills/commit/SKILL.md (100%) rename {.claude => .agents}/skills/create-pr/SKILL.md (100%) rename {.claude => .agents}/skills/new-sdg/SKILL.md (100%) rename {.claude => .agents}/skills/review-code/SKILL.md (94%) rename {.claude => .agents}/skills/search-docs/SKILL.md (100%) rename {.claude => .agents}/skills/search-github/SKILL.md (100%) rename {.claude => .agents}/skills/update-pr/SKILL.md (100%) create mode 120000 .claude/agents create mode 120000 .claude/skills create mode 100644 DEVELOPMENT.md create mode 100644 STYLEGUIDE.md create mode 100644 architecture/agent-introspection.md create mode 100644 architecture/cli.md create mode 100644 architecture/config.md create mode 100644 architecture/dataset-builders.md create mode 100644 architecture/engine.md create mode 100644 architecture/mcp.md create mode 100644 architecture/models.md create mode 100644 architecture/overview.md create mode 100644 architecture/plugins.md create mode 100644 architecture/sampling.md diff --git a/.agents/README.md b/.agents/README.md new file mode 100644 index 000000000..4efd1e332 --- /dev/null +++ b/.agents/README.md @@ -0,0 +1,23 @@ +# .agents/ + +This is the tool-agnostic home for shared agent infrastructure used in **developing** DataDesigner. + +## Structure + +``` +.agents/ +├── skills/ # Development skills (commit, create-pr, review-code, etc.) +├── agents/ # Sub-agent persona definitions (docs-searcher, github-searcher) +└── README.md # This file +``` + +## Compatibility + +Tool-specific directories symlink back here so each harness resolves skills from the same source: + +- `.claude/skills` → `.agents/skills` +- `.claude/agents` → `.agents/agents` + +## Scope + +All skills and agents in this directory are for **contributors developing DataDesigner** — not for end users building datasets. If you're looking for usage tooling, see the [product documentation](https://nvidia-nemo.github.io/DataDesigner/). diff --git a/.claude/agents/docs-searcher.md b/.agents/agents/docs-searcher.md similarity index 100% rename from .claude/agents/docs-searcher.md rename to .agents/agents/docs-searcher.md diff --git a/.claude/agents/github-searcher.md b/.agents/agents/github-searcher.md similarity index 100% rename from .claude/agents/github-searcher.md rename to .agents/agents/github-searcher.md diff --git a/.claude/skills/commit/SKILL.md b/.agents/skills/commit/SKILL.md similarity index 100% rename from .claude/skills/commit/SKILL.md rename to .agents/skills/commit/SKILL.md diff --git a/.claude/skills/create-pr/SKILL.md b/.agents/skills/create-pr/SKILL.md similarity index 100% rename from .claude/skills/create-pr/SKILL.md rename to .agents/skills/create-pr/SKILL.md diff --git a/.claude/skills/new-sdg/SKILL.md b/.agents/skills/new-sdg/SKILL.md similarity index 100% rename from .claude/skills/new-sdg/SKILL.md rename to .agents/skills/new-sdg/SKILL.md diff --git a/.claude/skills/review-code/SKILL.md b/.agents/skills/review-code/SKILL.md similarity index 94% rename from .claude/skills/review-code/SKILL.md rename to .agents/skills/review-code/SKILL.md index 922afb0bf..910664269 100644 --- a/.claude/skills/review-code/SKILL.md +++ b/.agents/skills/review-code/SKILL.md @@ -88,16 +88,13 @@ Where `` is `main` unless overridden in arguments. ## Step 2: Load Project Guidelines -Read `AGENTS.md` at the repository root to load the project's coding standards, design principles, and conventions. This is the authoritative source for: +Read the following files at the repository root to load the project's standards and conventions: -- Code style rules (formatting, naming, imports, type annotations) -- Design principles (DRY, KISS, YAGNI, SOLID) -- Testing patterns and expectations -- Architecture and layering conventions -- Common pitfalls to watch for -- Lazy loading and `TYPE_CHECKING` patterns +- **`AGENTS.md`** — architecture, layering, core design principles, structural invariants +- **`STYLEGUIDE.md`** — code style rules (formatting, naming, imports, type annotations), design principles (DRY, KISS, YAGNI, SOLID), common pitfalls, lazy loading and `TYPE_CHECKING` patterns +- **`DEVELOPMENT.md`** — testing patterns and expectations -Use these guidelines as the baseline for the entire review. Any project-specific rules in `AGENTS.md` take precedence over general best practices. +Use these guidelines as the baseline for the entire review. Project-specific rules take precedence over general best practices. ## Step 3: Understand the Scope @@ -163,9 +160,9 @@ Final pass focused on **project conventions and test quality for new/modified co - Are mocks/stubs used appropriately (at boundaries, not deep internals)? - Do new test names clearly describe what they verify? -**Project Standards (from AGENTS.md) — apply to new/modified code only:** +**Project Standards (from AGENTS.md and STYLEGUIDE.md) — apply to new/modified code only:** -Verify the items below on lines introduced or changed by this branch. Refer to the full `AGENTS.md` loaded in Step 2 for details and examples. +Verify the items below on lines introduced or changed by this branch. Refer to `AGENTS.md` and `STYLEGUIDE.md` loaded in Step 2 for details and examples. - License headers: if present, they should be correct (wrong year or format → suggest `make update-license-headers`; don't treat as critical if CI enforces this) - `from __future__ import annotations` in new files diff --git a/.claude/skills/search-docs/SKILL.md b/.agents/skills/search-docs/SKILL.md similarity index 100% rename from .claude/skills/search-docs/SKILL.md rename to .agents/skills/search-docs/SKILL.md diff --git a/.claude/skills/search-github/SKILL.md b/.agents/skills/search-github/SKILL.md similarity index 100% rename from .claude/skills/search-github/SKILL.md rename to .agents/skills/search-github/SKILL.md diff --git a/.claude/skills/update-pr/SKILL.md b/.agents/skills/update-pr/SKILL.md similarity index 100% rename from .claude/skills/update-pr/SKILL.md rename to .agents/skills/update-pr/SKILL.md diff --git a/.claude/agents b/.claude/agents new file mode 120000 index 000000000..4c8a5fc93 --- /dev/null +++ b/.claude/agents @@ -0,0 +1 @@ +../.agents/agents \ No newline at end of file diff --git a/.claude/skills b/.claude/skills new file mode 120000 index 000000000..2b7a412b8 --- /dev/null +++ b/.claude/skills @@ -0,0 +1 @@ +../.agents/skills \ No newline at end of file diff --git a/AGENTS.md b/AGENTS.md index 7d46ef76f..5d0152a65 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -1,626 +1,55 @@ # AGENTS.md -This file provides guidance to agents when working with code in this repository. +This file is for agents **developing** DataDesigner — the codebase you are working in. +If you are an agent helping a user **build a dataset**, see the [product documentation](https://nvidia-nemo.github.io/DataDesigner/) and tutorials instead. -## Project Overview +**DataDesigner** is an NVIDIA NeMo framework for creating synthetic datasets from scratch. Users declare what their data should look like (columns, types, relationships, validation rules); the engine figures out how to generate it. Every change you make should preserve this "declare, don't orchestrate" contract. -**DataDesigner** is an NVIDIA NeMo project for creating synthetic datasets from scratch. It's a comprehensive framework that generates structured data using multiple generation strategies: +## The Layering Is Structural -- **Sampled data**: Built-in generators (UUID, DateTime, etc.) and Faker integration -- **LLM-generated content**: Text, code, and structured data via LiteLLM -- **Expression-based columns**: Derived columns using Jinja2 templates -- **Validation & scoring**: Python, SQL, and remote validators; LLM-based judge scoring -- **Seed dataset-based generation**: Generate from existing datasets +The `data_designer` namespace is split across three installable packages that merge at runtime via PEP 420 implicit namespace packages (no top-level `__init__.py`). -### Architecture +| Package | Path | Owns | +|---------|------|------| +| `data-designer-config` | `packages/data-designer-config/` | `data_designer.config` — column configs, model configs, sampler params, builder API, plugin system, lazy imports | +| `data-designer-engine` | `packages/data-designer-engine/` | `data_designer.engine` — column generators, dataset builders, DAG execution, model facade, validators, sampling | +| `data-designer` | `packages/data-designer/` | `data_designer.interface` — public `DataDesigner` class, results, errors; `data_designer.cli` — CLI entry point; `data_designer.integrations` | -The project follows a layered architecture: +**Dependency direction:** config ← engine ← interface. Never import against this flow. -1. **Config Layer** ([packages/data-designer-config/src/data_designer/config/](packages/data-designer-config/src/data_designer/config/)): User-facing configuration API - - `config_builder.py`: Main builder API for constructing configurations - - `column_configs.py`: Column configuration types (Sampler, LLMText, LLMCode, LLMStructured, LLMJudge, Expression, Validation, SeedDataset) - - `models.py`: Model configurations and inference parameters - - `sampler_params.py`: Parametrized samplers (Uniform, Category, Person, DateTime, etc.) +## Core Concepts -2. **Engine Layer** ([packages/data-designer-engine/src/data_designer/engine/](packages/data-designer-engine/src/data_designer/engine/)): Internal generation and processing - - `column_generators/`: Generates individual columns from configs - - `dataset_builders/`: Orchestrates full dataset generation with DAG-based dependency management - - `models/`: LLM integration via LiteLLM with response parsing - - `validators/`: Column validation (Python, SQL, Code, Remote) - - `sampling_gen/`: Sophisticated person/entity sampling +- **Column** — a named field in the output dataset, defined by a column config +- **Sampler** — a built-in statistical generator (UUID, Category, Uniform, Gaussian, Person, DateTime, etc.) +- **Seed dataset** — an existing dataset used as input for generation +- **Processor** — a post-generation transformation applied to column values +- **Model** — an LLM endpoint configured via `ModelConfig` and accessed through the model facade +- **Plugin** — a user-supplied extension registered via entry points (custom column generators, validators, profilers) -3. **Interface Layer** ([packages/data-designer/src/data_designer/interface/](packages/data-designer/src/data_designer/interface/)): Public API - - `data_designer.py`: Main `DataDesigner` class (primary entry point) - - `results.py`: Result containers - - `errors.py`: Public error types +## Core Design Principles -### Recommended Import Pattern +1. **Declarative config, imperative engine.** Users build configs; the engine compiles them into an execution plan. Config objects are data; they never call the engine directly. +2. **Registries connect types to behavior.** Column generators, validators, and profilers are discovered through registries. Adding a new type means registering it, not modifying orchestration code. +3. **Errors normalize at boundaries.** Third-party exceptions are wrapped into canonical project error types at module boundaries. Callers depend on `data_designer` errors, not leaked internals. -```python -import data_designer.config as dd -from data_designer.interface import DataDesigner +## Structural Invariants -# Usage: -data_designer = DataDesigner() -config_builder = dd.DataDesignerConfigBuilder() -config_builder.add_column( - dd.SamplerColumnConfig( - name="category", - sampler_type=dd.SamplerType.CATEGORY, - params=dd.CategorySamplerParams(values=["A", "B"]), - ) -) -``` - -### Key Design Patterns - -- **Builder pattern**: Configuration construction via `DataDesignerConfigBuilder` -- **Registry pattern**: Plugin system for column generators, validators, and profilers -- **Strategy pattern**: Multiple generation approaches (sampled, LLM, expression, seed) -- **DAG-based execution**: Column dependencies managed as directed acyclic graph +- **Import direction** — config ← engine ← interface. No reverse imports. +- **Fast imports** — heavy third-party libraries are lazy-loaded via `data_designer.lazy_heavy_imports`. See [STYLEGUIDE.md](STYLEGUIDE.md) for the pattern. +- **No relative imports** — absolute imports only, enforced by ruff rule `TID`. +- **Typed code** — all functions, methods, and class attributes require type annotations. Modern syntax: `list[str]`, `str | None`. +- **`from __future__ import annotations`** — required in every Python source file. +- **Follow established patterns** — match the conventions of the module you're editing. When in doubt, read the neighboring code. +- **No untested code paths** — new logic requires tests. See [DEVELOPMENT.md](DEVELOPMENT.md) for testing guidance. -## Development Workflow - -This project uses `uv` for dependency management and `make` for common tasks: +## Development ```bash -# Install dependencies -uv sync - -# Install with dev dependencies -uv sync --all-extras - -# Run the main module (if applicable) -uv run python -m data_designer -``` - -### Code Quality - -```bash -# Using Make (recommended) -make lint # Run ruff linter -make lint-fix # Fix linting issues automatically -make format # Format code with ruff -make format-check # Check code formatting without changes -make check-all # Run all checks (format-check + lint) -make check-all-fix # Run all checks with autofix (format + lint-fix) - -# Direct commands -uv run ruff check # Lint all files -uv run ruff check --fix # Lint with autofix -uv run ruff format # Format all files -uv run ruff format --check # Check formatting -``` - -### Running Tests - -```bash -# Run all tests -uv run pytest - -# Run tests with verbose output -uv run pytest -v - -# Run a specific test file -uv run pytest tests/config/test_sampler_constraints.py - -# Run tests with coverage -uv run pytest --cov=data_designer --cov-report=term-missing --cov-report=html - -# Using Make -make test # Run all tests -make coverage # Run tests with coverage report -``` - -## Key Files - -- [packages/data-designer/src/data_designer/interface/data_designer.py](packages/data-designer/src/data_designer/interface/data_designer.py) - Main entry point (`DataDesigner` class) -- [packages/data-designer-config/src/data_designer/config/config_builder.py](packages/data-designer-config/src/data_designer/config/config_builder.py) - Configuration API (`DataDesignerConfigBuilder`) -- [packages/data-designer-config/src/data_designer/config/__init__.py](packages/data-designer-config/src/data_designer/config/__init__.py) - User-facing config API exports -- [packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py](packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py) - Generation orchestrator -- [pyproject.toml](pyproject.toml) - Project dependencies and tool configurations -- [Makefile](Makefile) - Common development commands - -## Working Guidelines - -- **Comments**: Only insert comments when code is especially important to understand. For basic code blocks, comments aren't necessary. We want readable code without vacuous comments. -- **License headers**: All Python files must include the NVIDIA SPDX license header: - ```python - # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. - # SPDX-License-Identifier: Apache-2.0 - ``` - Use `make update-license-headers` to add headers to all files automatically. -- **`from __future__ import annotations`**: Include at the top of all Python source files for deferred type evaluation. -- **Imports**: Avoid importing Python modules inside method definitions. Prefer module-level imports for better performance and clarity. See [Code Style](#code-style) for detailed import and type annotation guidelines. - -## Code Style - -This project uses `ruff` (>=0.14.10) for linting and formatting. Follow these guidelines to avoid linter errors: - -### General Formatting - -- **Line length**: Maximum 120 characters per line -- **Quote style**: Always use double quotes (`"`) for strings -- **Indentation**: Use 4 spaces (never tabs) -- **Target version**: Python 3.10+ - -### Type Annotations - -Type annotations are REQUIRED for all code in this project. This is strictly enforced for code quality and maintainability. Modern type syntax is enforced by ruff rules `UP006`, `UP007`, and `UP045`. - -- **ALWAYS** add type annotations to all functions, methods, and class attributes (including tests) -- Use primitive types when possible: `list` not `List`, `dict` not `Dict`, `set` not `Set`, `tuple` not `Tuple` (enforced by `UP006`) -- Use modern union syntax with `|` for optional and union types: - - `str | None` not `Optional[str]` (enforced by `UP045`) - - `int | str` not `Union[int, str]` (enforced by `UP007`) -- Only import from `typing` when absolutely necessary for complex generic types -- For Pydantic models, use field-level type annotations - - ```python - # Good - def process_items(items: list[str], max_count: int | None = None) -> dict[str, int]: - return {item: len(item) for item in items} - - # Avoid - missing type annotations - def process_items(items, max_count=None): - return {item: len(item) for item in items} - - # Avoid - old-style typing - from typing import List, Dict, Optional - def process_items(items: List[str], max_count: Optional[int] = None) -> Dict[str, int]: - return {item: len(item) for item in items} - ``` - -### Import Style - -- **ALWAYS** include `from __future__ import annotations` at the top of every Python source file (after the license header) for deferred type evaluation -- **ALWAYS** use absolute imports, never relative imports (enforced by `TID`) -- Place imports at module level, not inside functions (exception: it is unavoidable for performance reasons) -- Import sorting is handled by `ruff`'s `isort` - imports should be grouped and sorted: - 1. Standard library imports - 2. Third-party imports (use `lazy_heavy_imports` for heavy libraries) - 3. First-party imports (`data_designer`) -- Use standard import conventions (enforced by `ICN`) -- See [Lazy Loading and TYPE_CHECKING](#lazy-loading-and-type_checking) section for optimization guidelines - - ```python - # Good - from data_designer.config.config_builder import DataDesignerConfigBuilder - - # Bad - relative import (will cause linter errors) - from .config_builder import DataDesignerConfigBuilder - - # Good - imports at module level - from pathlib import Path - - def process_file(filename: str) -> None: - path = Path(filename) - - # Bad - import inside function - def process_file(filename: str) -> None: - from pathlib import Path - path = Path(filename) - ``` - -### Lazy Loading and TYPE_CHECKING - -This project uses lazy loading for heavy third-party dependencies to optimize import performance. - -#### When to Use Lazy Loading - -**Heavy third-party libraries** (>100ms import cost) should be lazy-loaded via `lazy_heavy_imports.py`: - -```python -# ❌ Don't import directly -import pandas as pd -import numpy as np - -# ✅ Use lazy loading with IDE support -from typing import TYPE_CHECKING -from data_designer.lazy_heavy_imports import pd, np - -if TYPE_CHECKING: - import pandas as pd # For IDE autocomplete and type hints - import numpy as np +make check-all-fix # format + lint (ruff) +make test # run all test suites +make update-license-headers # add SPDX headers to new files +make perf-import CLEAN=1 # profile import time (run after adding heavy deps) ``` -This pattern provides: -- Runtime lazy loading (fast startup) -- Full IDE support (autocomplete, type hints) -- Type checker validation - -**See [lazy_heavy_imports.py](packages/data-designer-config/src/data_designer/lazy_heavy_imports.py) for the current list of lazy-loaded libraries.** - -#### Adding New Heavy Dependencies - -If you add a new dependency with significant import cost (>100ms): - -1. **Add to `lazy_heavy_imports.py`:** - ```python - _LAZY_IMPORTS = { - # ... existing entries ... - "your_lib": "your_library_name", - } - ``` - -2. **Update imports across codebase:** - ```python - from typing import TYPE_CHECKING - from data_designer.lazy_heavy_imports import your_lib - - if TYPE_CHECKING: - import your_library_name as your_lib # For IDE support - ``` - -3. **Verify with performance test:** - ```bash - make perf-import CLEAN=1 - ``` - -#### Using TYPE_CHECKING Blocks - -`TYPE_CHECKING` blocks defer imports that are only needed for type hints, preventing circular dependencies and reducing import time. - -**For internal data_designer imports:** - -```python -from __future__ import annotations - -from typing import TYPE_CHECKING - -# Runtime imports -from pathlib import Path -from data_designer.config.base import ConfigBase - -if TYPE_CHECKING: - # Type-only imports - only visible to type checkers - from data_designer.engine.models.facade import ModelFacade - -def get_model(model: ModelFacade) -> str: - return model.name -``` - -**For lazy-loaded libraries (see pattern in "When to Use Lazy Loading" above):** -- Import from `lazy_heavy_imports` for runtime -- Add full import in `TYPE_CHECKING` block for IDE support - -**Rules for TYPE_CHECKING:** - -✅ **DO put in TYPE_CHECKING:** -- Internal `data_designer` imports used **only** in type hints -- Imports that would cause circular dependencies -- **Full imports of lazy-loaded libraries for IDE support** (e.g., `import pandas as pd` in addition to runtime `from data_designer.lazy_heavy_imports import pd`) - -❌ **DON'T put in TYPE_CHECKING:** -- **Standard library imports** (`Path`, `Any`, `Callable`, `Literal`, `TypeAlias`, etc.) -- **Pydantic model types** used in field definitions (needed at runtime for validation) -- **Types used in discriminated unions** (Pydantic needs them at runtime) -- **Any import used at runtime** (instantiation, method calls, base classes, etc.) - -**Examples:** - -```python -# ✅ CORRECT - Lazy-loaded library with IDE support -from typing import TYPE_CHECKING -from data_designer.lazy_heavy_imports import pd - -if TYPE_CHECKING: - import pandas as pd # IDE gets full type hints - -def load_data(path: str) -> pd.DataFrame: # IDE understands pd.DataFrame - return pd.read_csv(path) - -# ✅ CORRECT - Standard library NOT in TYPE_CHECKING -from pathlib import Path -from typing import Any - -def process_file(path: Path) -> Any: - return path.read_text() - -# ✅ CORRECT - Internal type-only import -from typing import TYPE_CHECKING - -if TYPE_CHECKING: - from data_designer.engine.models.facade import ModelFacade - -def get_model(model: ModelFacade) -> str: # Only used in type hint - return model.name - -# ❌ INCORRECT - Pydantic field type in TYPE_CHECKING -from typing import TYPE_CHECKING - -if TYPE_CHECKING: - from data_designer.config.models import ModelConfig # Wrong! - -class MyConfig(BaseModel): - model: ModelConfig # Pydantic needs this at runtime! - -# ✅ CORRECT - Pydantic field type at runtime -from data_designer.config.models import ModelConfig - -class MyConfig(BaseModel): - model: ModelConfig -``` - -### Naming Conventions (PEP 8) - -Follow PEP 8 naming conventions: - -- **Functions and variables**: `snake_case` -- **Classes**: `PascalCase` -- **Constants**: `UPPER_SNAKE_CASE` -- **Private attributes**: prefix with single underscore `_private_var` -- **Function and method names must start with an action verb**: e.g. `get_value_from` not `value_from`, `coerce_to_int` not `to_int`, `extract_usage` not `usage` - - ```python - # Good - class DatasetGenerator: - MAX_RETRIES = 3 - - def __init__(self) -> None: - self._cache: dict[str, str] = {} - - def generate_dataset(self, config: dict[str, str]) -> pd.DataFrame: - pass - - # Bad - class dataset_generator: # Should be PascalCase - maxRetries = 3 # Should be UPPER_SNAKE_CASE - - def GenerateDataset(self, Config): # Should be snake_case - pass - ``` - -### Code Organization - -- **Public before private**: Public functions/methods appear before private ones in modules and classes -- **Class method order**: `__init__` and other dunder methods first, then properties, then public methods, then private helpers. Group related method types together (e.g., all `@staticmethod`s in one block, all `@classmethod`s in one block). -- **Prefer public over private for testability**: Use public functions (no `_` prefix) for helpers that benefit from direct testing -- **Section comments in larger modules**: Use `# ---` separators to delineate logical groups (e.g. image parsing, usage extraction, generic accessors) - -### Design Principles - -**DRY** -- Extract shared logic into pure helper functions rather than duplicating across similar call sites -- Rule of thumb: tolerate duplication until the third occurrence, then extract - -**KISS** -- Prefer flat, obvious code over clever abstractions — two similar lines is better than a premature helper -- When in doubt between DRY and KISS, favor readability over deduplication - -**YAGNI** -- Don't add parameters, config, or abstraction layers for hypothetical future use cases -- Don't generalize until the third caller appears - -**SOLID** -- Wrap third-party exceptions at module boundaries — callers depend on canonical error types, not leaked internals -- Use `Protocol` for contracts between layers -- One function, one job — separate logic from I/O - -### Common Pitfalls to Avoid - -1. **Mutable default arguments**: - - ```python - # Bad - mutable default argument - def add_item(item: str, items: list[str] = []) -> list[str]: - items.append(item) - return items - - # Good - def add_item(item: str, items: list[str] | None = None) -> list[str]: - if items is None: - items = [] - items.append(item) - return items - ``` - -2. **Unused imports and variables**: - - ```python - # Bad - unused import - from pathlib import Path - from typing import Any # Not used - - def process() -> None: - pass - - # Good - only import what you use - from pathlib import Path - - def process() -> None: - pass - ``` - -3. **Simplify code where possible** (enforced by `SIM`): - - ```python - # Bad - if condition: - return True - else: - return False - - # Good - return condition - - # Bad - if key in my_dict: - value = my_dict[key] - else: - value = default - - # Good - value = my_dict.get(key, default) - ``` - -4. **Use comprehensions properly**: - - ```python - # Bad - list([x for x in items]) # Unnecessary list() call - - # Good - [x for x in items] - - # Bad - dict([(k, v) for k, v in items]) - - # Good - {k: v for k, v in items} - ``` - -5. **Proper return statements**: - - ```python - # Bad - unnecessary else after return - def get_value(condition: bool) -> str: - if condition: - return "yes" - else: - return "no" - - # Good - def get_value(condition: bool) -> str: - if condition: - return "yes" - return "no" - ``` - -### Active Linter Rules - -The following ruff linter rules are currently enabled (see [pyproject.toml](pyproject.toml)): - -- `W`: pycodestyle warnings -- `F`: pyflakes (unused imports, undefined names) -- `I`: isort (import sorting) -- `ICN`: flake8-import-conventions (standard import names) -- `PIE`: flake8-pie (miscellaneous lints) -- `TID`: flake8-tidy-imports (bans relative imports) -- `UP006`: `List[A]` -> `list[A]` -- `UP007`: `Union[A, B]` -> `A | B` -- `UP045`: `Optional[A]` -> `A | None` - -**Note**: Additional rules (E, N, ANN, B, C4, DTZ, RET, SIM, PTH) are commented out but may be enabled in the future. Write code that would pass these checks for future-proofing. - -## Testing Patterns - -The project uses `pytest` with the following patterns: - -- **Fixtures**: Shared fixtures are provided via `pytest_plugins` from `data_designer.config.testing.fixtures` and `data_designer.engine.testing.fixtures`, plus local `conftest.py` files in each test directory -- **Stub configs**: YAML-based configuration stubs for testing (see `stub_data_designer_config_str` fixture) -- **Mocking**: Use `unittest.mock.patch` for external services and dependencies -- **Async support**: pytest-asyncio for async tests (`asyncio_default_fixture_loop_scope = "session"`) -- **HTTP mocking**: pytest-httpx for mocking HTTP requests -- **Coverage**: Track test coverage with pytest-cov - -### Test Guidelines - -- **Test public APIs only**: Tests should exercise public interfaces, not `_`-prefixed functions or classes. If something is hard to test without reaching into private internals, consider refactoring the code to expose a public entry point -- **Type annotations required**: Test functions and fixtures must include type annotations — `-> None` for tests, typed parameters, and typed return values for fixtures -- **Imports at module level**: Follow the same import rules as production code — keep imports at the top of the file, not inside test functions -- **Parametrize over duplicate**: Use `@pytest.mark.parametrize` (with `ids=` for readable names) instead of writing multiple test functions for variations of the same behavior -- **Minimal fixtures**: Fixtures should be simple — one fixture, one responsibility, just setup with no behavior logic -- **Shared fixtures in `conftest.py`**: Place fixtures shared across a test directory in `conftest.py` -- **Mock at boundaries**: Mock external dependencies (APIs, databases, third-party services), not internal functions -- **Test behavior, not implementation**: Assert on outputs and side effects, not internal call counts (unless verifying routing) -- **Keep mocking shallow**: If a test requires deeply nested mocking, the code under test may need refactoring - -Example test structure: - -```python -from typing import Any - -from data_designer.config.config_builder import DataDesignerConfigBuilder - - -def test_something(stub_model_configs: dict[str, Any]) -> None: - """Test description.""" - builder = DataDesignerConfigBuilder(model_configs=stub_model_configs) - # ... test implementation - assert expected == actual -``` - -## Column Configuration Types - -When working with column configurations, understand these key types: - -- **`SamplerColumnConfig`**: Built-in samplers (UUID, Category, Uniform, Gaussian, Person, DateTime, etc.) -- **`LLMTextColumnConfig`**: LLM text generation with Jinja2 templating -- **`LLMCodeColumnConfig`**: Code generation with language specification -- **`LLMStructuredColumnConfig`**: Structured JSON generation with schema -- **`LLMJudgeColumnConfig`**: Judge/scoring columns for quality assessment -- **`ExpressionColumnConfig`**: Expression-based derived columns (Python eval or Jinja2) -- **`ValidationColumnConfig`**: Validation results (Python, SQL, Code, Remote validators) -- **`SeedDatasetColumnConfig`**: Data from seed datasets -- **`EmbeddingColumnConfig`**: Embedding generation for text columns using a specified model -- **`CustomColumnConfig`**: Custom user-defined column generators via `@custom_column_generator` decorator - -See [packages/data-designer-config/src/data_designer/config/column_configs.py](packages/data-designer-config/src/data_designer/config/column_configs.py) for detailed schemas. - -## Model Configuration - -Models are configured via `ModelConfig` with: - -- `alias`: User-defined alias for the model -- `model`: Model ID (e.g., from build.nvidia.com) -- `inference_parameters`: Temperature, top_p, max_tokens (can be distribution-based) -- `system_prompt`: Optional system prompt -- `image_modality`: Support for image inputs - -See [packages/data-designer-config/src/data_designer/config/models.py](packages/data-designer-config/src/data_designer/config/models.py) for details. - -## Registry System - -The project uses a registry pattern for extensibility. Key registries: - -- **Column generators**: [packages/data-designer-engine/src/data_designer/engine/column_generators/registry.py](packages/data-designer-engine/src/data_designer/engine/column_generators/registry.py) -- **Validators**: [packages/data-designer-engine/src/data_designer/engine/validators/](packages/data-designer-engine/src/data_designer/engine/validators/) -- **Column profilers**: [packages/data-designer-engine/src/data_designer/engine/analysis/column_profilers/registry.py](packages/data-designer-engine/src/data_designer/engine/analysis/column_profilers/registry.py) -- **Models**: [packages/data-designer-engine/src/data_designer/engine/models/registry.py](packages/data-designer-engine/src/data_designer/engine/models/registry.py) - -When adding new generators or validators, register them appropriately. - -## Pre-commit Hooks - -The project uses pre-commit hooks to enforce code quality. Install them with: - -```bash -uv run pre-commit install -``` - -Hooks include: -- Trailing whitespace removal -- End-of-file fixer -- YAML/JSON/TOML validation -- Merge conflict detection -- Debug statement detection -- Ruff linting and formatting - -## Common Development Tasks - -```bash -# Clean up generated files -make clean - -# Update license headers -make update-license-headers - -# Run all checks before committing -make check-all-fix -make test - -# Generate coverage report -make coverage -# View htmlcov/index.html in browser - -# Profile import performance (use after adding heavy dependencies) -make perf-import # Profile import time -make perf-import CLEAN=1 # Clean cache first, then profile -``` - -## Additional Resources - -- **README.md**: Installation and basic usage examples -- **packages/data-designer-config/src/data_designer/config/**: Configuration API documentation -- **tests/**: Comprehensive test suite with usage examples +For full setup, testing, and workflow details see [DEVELOPMENT.md](DEVELOPMENT.md). +For code style, naming, and import conventions see [STYLEGUIDE.md](STYLEGUIDE.md). diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 15115b367..22515d78b 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,236 +1,103 @@ # 🎨✨ Contributing to NeMo Data Designer 🎨✨ -Thank you for your interest in contributing to Data Designer! +The skills and workflows in this repository are for **developing** DataDesigner. If you're looking to **use** DataDesigner to build datasets, see the [product documentation](https://nvidia-nemo.github.io/DataDesigner/) instead. -We welcome contributions from the community and sincerely appreciate your efforts to improve the project. Whether you're fixing a typo, reporting a bug, proposing a new feature, or implementing a major enhancement, your work helps make Data Designer better for everyone 🎉. - -This guide will help you get started with the contribution process. - -## Table of Contents - -- [Getting Started](#getting-started) -- [Ways to Contribute](#ways-to-contribute) -- [Feature Requests](#feature-requests) -- [Development Guide](#development-guide) -- [Submitting Changes](#submitting-changes) -- [Code of Conduct](#code-of-conduct) -- [Signing off on your work](#signing-off-on-your-work) - - -## Getting Started -👋 Welcome to the Data Designer community! We're excited to have you here. - -Whether you're new to the project or ready to dive in, the resources below will help you get oriented and productive quickly: - -1. **[README.md](https://github.com/NVIDIA-NeMo/DataDesigner/blob/main/README.md)** – best place to start to learn the basics of the project - -2. **[AGENTS.md](https://github.com/NVIDIA-NeMo/DataDesigner/blob/main/AGENTS.md)** – context and instructions to help AI coding agents work on Data Designer (it's also useful for human developers!) - -3. **[Documentation](https://nvidia-nemo.github.io/DataDesigner/)** – detailed documentation on Data Designer's capabilities and usage - -## Ways to Contribute - -There are many ways to contribute to Data Designer: - -### 🐛 Bug Fixes - -Found a bug? Before reporting, please -1. Verify you're using the latest version: `uv pip install --upgrade data-designer` -2. Search for duplicates in the [issue tracker](https://github.com/NVIDIA-NeMo/DataDesigner/issues) - -When [creating a bug report](https://github.com/NVIDIA-NeMo/DataDesigner/issues/new), please include: -- Data Designer version -- Python version and operating system -- Minimal reproducible example -- Expected vs. actual behavior -- Full error messages and stack traces - -If you are interested in fixing the bug yourself, that's AWESOME! Please follow the [development guide](#development-guide) to get started. - -### ✨ Feature Implementation -Want to add new functionality? Great! Please review [our development approach](#feature-requests) and open a feature request to discuss the idea and get feedback before investing significant time on the implementation. - -### 📖 Documentation Improvements -Documentation is crucial for user adoption. Contributions that clarify usage, add examples, or fix typos are highly valued. - -### 💡 Examples and Tutorials -Share your use cases! Example notebooks and tutorials help others understand how to leverage Data Designer effectively. - -### 🧪 Test Coverage -Help us improve test coverage by adding tests for untested code paths or edge cases. - -## Feature Requests -Data Designer is designed to be as flexible and extensible as possible, and we welcome your ideas for pushing its capabilities even further! To keep the core library maintainable, while also supporting innovation, we take an incremental approach when adding new features – we explore what's already possible, extend through plugins when needed, and integrate the most broadly useful features into the core library: - -### How We Grow Data Designer -1. 🧗 **Explore what's possible**: Can your use case be achieved with current features? We've designed Data Designer to be composable – sometimes creative combinations of existing tools can accomplish what you need. Check out our examples or open an issue if you'd like help exploring this! - -2. 🔌 **Extend through plugins**: If existing features aren't quite enough, consider implementing your idea as a plugin that extends the core library. Plugins let you experiment and share functionality while keeping the core library focused. - -3. ⚙️ **Integrate into the core library**: If your feature or plugin proves broadly useful and aligns with Data Designer's goals, we'd love to integrate it into the core library! We're happy to discuss whether it's a good fit and how to move forward together. - -This approach helps us grow thoughtfully while keeping Data Designer focused and maintainable. - -### Submitting a Feature Request -Open a [new issue](https://github.com/NVIDIA-NeMo/DataDesigner/issues/new) with: - -- **Clear title**: Concise description of the feature -- **Use case**: Explain what problem this solves and why it's important -- **Proposed solution**: Describe how you envision the feature working -- **Alternatives considered**: Other approaches you've thought about -- **Examples**: Code examples or mockups of how users would interact with the feature -- **Willingness to implement**: Are you interested in implementing this yourself? - -## Development Guide -Data Designer uses [`uv`](https://github.com/astral-sh/uv) for dependency management. If you don't have uv installed, follow their [installation instructions](https://docs.astral.sh/uv/getting-started/installation/). - -### Initial Setup -0. **Create or find an issue** - - Before starting work, ensure there's an issue tracking your contribution: - - - For bug fixes: Search [existing issues](https://github.com/NVIDIA-NeMo/DataDesigner/issues) or [create a new one](https://github.com/NVIDIA-NeMo/DataDesigner/issues/new) - - For new features: Open a [feature request](#feature-requests) to discuss the approach first - - Comment on the issue to let maintainers know you're working on it - -1. **Fork and clone the repository** - - Start by [forking the Data Designer repository](https://github.com/NVIDIA-NeMo/DataDesigner/fork), then clone your fork and add the upstream remote: - - ```bash - git clone https://github.com/YOUR_GITHUB_USERNAME/DataDesigner.git - - cd DataDesigner - - git remote add upstream https://github.com/NVIDIA-NeMo/DataDesigner.git - ``` - -2. **Install dependencies** - - ```bash - # Install project with dev dependencies - make install-dev - - # Or, if you use Jupyter / IPython for development - make install-dev-notebooks - ``` - -3. **Verify your setup** - - ```bash - make test && make check-all - ``` - - If no errors are reported, you're ready to develop 🚀 - -### Making Changes - -1. **Create a feature branch** - - ```bash - git checkout main - git pull upstream main - git checkout -b //- - ``` - - Example types of change: - - - `feat` for new features - - `fix` for bug fixes - - `docs` for documentation updates - - `test` for testing changes - - `refactor` for code refactoring - - `chore` for chore tasks - - `style` for style changes - - `perf` for performance improvements - - Example branch name: - - - `johnnygreco/feat/123-add-xyz-generator` for a new feature by @johnnygreco, addressing issue #123 - -2. **Develop your changes** - - Please follow the patterns and conventions used throughout the codebase, as well as those outlined in [AGENTS.md](https://github.com/NVIDIA-NeMo/DataDesigner/blob/main/AGENTS.md). - -3. **Test and validate** +--- - ```bash - make check-all-fix # Format code and fix linting issues - make test # Run all tests - make coverage # Check test coverage (must be >90%) - ``` +This project uses agent-assisted development. Contributors are expected to use agents for investigation, planning, and implementation. The repository includes [skills and guidance](.agents/) that make agents effective contributors. - **Writing tests**: Place tests in [tests/](https://github.com/NVIDIA-NeMo/DataDesigner/blob/main/tests/) mirroring the source structure. Use fixtures from [tests/conftest.py](https://github.com/NVIDIA-NeMo/DataDesigner/blob/main/tests/conftest.py), mock external services with `unittest.mock` or `pytest-httpx`, and test both success and failure cases. See [AGENTS.md](https://github.com/NVIDIA-NeMo/DataDesigner/blob/main/AGENTS.md) for patterns and examples. +Agents accelerate work; humans stay accountable. People make design decisions and own quality — agents help get there faster. -4. **Commit your work** +## How to Contribute - Write clear, descriptive commit messages, optionally including a brief summary (50 characters or less) and reference issue numbers when applicable (e.g., "Fixes #123"). +1. **Open an issue** using the appropriate [issue template](https://github.com/NVIDIA-NeMo/DataDesigner/issues/new/choose). +2. **Include investigation output.** If you used an agent, paste its diagnostics. If you didn't, include the troubleshooting you tried. +3. **For non-trivial changes, submit a plan** in the issue for review before building. Plans should describe the approach, trade-offs considered, and affected subsystems. +4. **Once approved, implement** using agent-assisted development. See [DEVELOPMENT.md](DEVELOPMENT.md) for local setup and workflow. - ```bash - git commit -m "Add XYZ generator for synthetic data" -m "Fixes #123" - ``` +## Before You Open an Issue -5. **Stay up to date** +- Clone the repo and point your agent at it +- Have the agent search docs and existing issues (the `search-docs` and `search-github` skills can help) +- If the agent can't resolve it, include the diagnostics in your issue +- If you didn't use an agent, include the troubleshooting you already tried - Regularly sync your branch with upstream changes: +## When to Open an Issue - ```bash - git fetch upstream - git merge upstream/main - ``` +- Real bugs — reproduced or agent-confirmed +- Feature proposals with design context +- Problems that `search-docs` / `search-github` couldn't resolve -## Submitting Changes +## When NOT to Open an Issue -### Before Submitting +- Questions about how things work — an agent can answer these from the codebase +- Configuration problems — an agent can diagnose these +- "How do I..." requests — try the [product documentation](https://nvidia-nemo.github.io/DataDesigner/) first -Ensure your changes meet the following criteria: +--- -- All tests pass (`make test`) -- Code is formatted and linted (`make check-all-fix`) -- New functionality includes tests -- Documentation is updated (README, docstrings, examples) -- License headers are present on all new files -- Commit messages are clear and descriptive +## Development Skills -### Creating a Pull Request +The repository includes skills for common development tasks. These are located in [`.agents/skills/`](.agents/skills/) and are automatically discovered by agent harnesses. -1. **Push your changes** to your fork: +| Category | Skills | Purpose | +| ------------- | ---------------------------------- | -------------------------------------- | +| Investigation | `search-docs`, `search-github` | Find information, check for duplicates | +| Development | `commit`, `create-pr`, `update-pr` | Standard development cycle | +| Review | `review-code` | Multi-pass code review | - ```bash - git push origin //- - ``` +--- -2. **Open a pull request** on GitHub from your fork to the main repository +## Pull Requests -3. **Respond to review feedback** update your PR as needed +- PRs should link to the issue they address (`Fixes #NNN` or `Closes #NNN`) +- Use the `create-pr` skill for well-formatted PR descriptions, or follow the PR template +- Ensure all checks pass before requesting review: + ```bash + make check-all-fix + make test + ``` ### Pull Request Review Process - Maintainers will review your PR and may request changes - Address feedback by pushing additional commits to your branch -- Reply to the feedback comment with a link to the commit that addresses it. +- Reply to the feedback comment with a link to the commit that addresses it - Once approved, a maintainer will merge your PR -- Your contribution will be included in the next release! -## Code of Conduct -Data Designer follows the Contributor Covenant Code of Conduct. We are committed to providing a welcoming and inclusive environment for all contributors. +--- -**Please read our complete [Code of Conduct](https://github.com/NVIDIA-NeMo/DataDesigner/blob/main/CODE_OF_CONDUCT.md)** for full details on our standards and expectations. +## Commit Messages -### License File Headers -All code files that are added to this repository must include the appropriate NVIDIA copyright header: +- Use imperative mood ("add feature" not "added feature") +- Keep the subject line under 50 characters (hard limit: 72) +- Reference issue numbers when applicable (`Fixes #123`) + +--- + +## License Headers + +All code files must include the NVIDIA copyright header: ```python -# SPDX-FileCopyrightText: Copyright (c) {YEAR} NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0 ``` Use `make update-license-headers` to add headers automatically. -## Signing off on your work +## Signing Off on Your Work (DCO) -When contributing to this project, you must agree that you have authored 100% of the content, that you have the necessary rights to the content and that the content you contribute may be provided under the project license. All contributors are asked to sign the Data Designer [Developer Certificate of Origin (DCO)](https://github.com/NVIDIA-NeMo/DataDesigner/blob/main/DCO) when submitting their first pull request. The process is automated by a bot that will comment on the pull request. Our DCO is the same as the Linux Foundation requires its contributors to sign. +When contributing, you must agree that you have authored 100% of the content, that you have the necessary rights to the content, and that the content you contribute may be provided under the project license. All contributors are asked to sign the [Developer Certificate of Origin (DCO)](DCO) when submitting their first pull request. The process is automated by a bot that will comment on the pull request. + +## Code of Conduct + +Data Designer follows the Contributor Covenant Code of Conduct. Please read our complete [Code of Conduct](CODE_OF_CONDUCT.md) for full details. --- -Thank you for contributing to NeMo Data Designer! Your efforts help make synthetic data generation more accessible and powerful for everyone. 🎨✨ +## Reference + +- [AGENTS.md](AGENTS.md) — architecture, layering, design principles +- [STYLEGUIDE.md](STYLEGUIDE.md) — code style, naming, imports, type annotations +- [DEVELOPMENT.md](DEVELOPMENT.md) — local setup, testing, day-to-day workflow diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md new file mode 100644 index 000000000..86ea62031 --- /dev/null +++ b/DEVELOPMENT.md @@ -0,0 +1,195 @@ +# Development Guide + +This document covers local setup, day-to-day development workflow, testing, and pre-commit usage for DataDesigner contributors. + +For architectural invariants and project identity, see [AGENTS.md](AGENTS.md). +For code style, naming, and import conventions, see [STYLEGUIDE.md](STYLEGUIDE.md). +For the contribution workflow (issues, PRs, agent-assisted development), see [CONTRIBUTING.md](CONTRIBUTING.md). + +--- + +## Prerequisites + +- **Python 3.10+** +- **[uv](https://docs.astral.sh/uv/getting-started/installation/)** for dependency management +- **[GNU Make](https://www.gnu.org/software/make/)** for common tasks + +## Local Setup + +### Clone and Install + +```bash +git clone https://github.com/NVIDIA-NeMo/DataDesigner.git +cd DataDesigner + +# Install with dev dependencies +make install-dev + +# Or, if you use Jupyter / IPython for development +make install-dev-notebooks +``` + +### Verify Your Setup + +```bash +make test && make check-all +``` + +If no errors are reported, you're ready to develop. + +--- + +## Day-to-Day Workflow + +### Branching + +```bash +git checkout main +git pull origin main +git checkout -b //- +``` + +Branch name types: `feat`, `fix`, `docs`, `test`, `refactor`, `chore`, `style`, `perf`. + +Example: `nmulepati/feat/123-add-xyz-generator` + +### Syncing with Upstream + +```bash +git fetch origin +git merge origin/main +``` + +### Validation Before Committing + +```bash +make check-all-fix # format + lint (ruff) +make test # run all test suites +``` + +--- + +## Code Quality + +### Using Make (Recommended) + +```bash +make lint # Run ruff linter +make lint-fix # Fix linting issues automatically +make format # Format code with ruff +make format-check # Check code formatting without changes +make check-all # Run all checks (format-check + lint) +make check-all-fix # Run all checks with autofix (format + lint-fix) +``` + +### Direct Commands + +```bash +uv run ruff check # Lint all files +uv run ruff check --fix # Lint with autofix +uv run ruff format # Format all files +uv run ruff format --check # Check formatting +``` + +--- + +## Testing + +### Running Tests + +```bash +# Run all tests +make test + +# Run tests with verbose output +uv run pytest -v + +# Run a specific test file +uv run pytest tests/config/test_sampler_constraints.py + +# Run tests with coverage +make coverage +# View htmlcov/index.html in browser +``` + +### Test Patterns + +The project uses `pytest` with the following patterns: + +- **Fixtures**: Shared fixtures are provided via `pytest_plugins` from `data_designer.config.testing.fixtures` and `data_designer.engine.testing.fixtures`, plus local `conftest.py` files in each test directory +- **Stub configs**: YAML-based configuration stubs for testing (see `stub_data_designer_config_str` fixture) +- **Mocking**: Use `unittest.mock.patch` for external services and dependencies +- **Async support**: pytest-asyncio for async tests (`asyncio_default_fixture_loop_scope = "session"`) +- **HTTP mocking**: pytest-httpx for mocking HTTP requests +- **Coverage**: Track test coverage with pytest-cov + +### Test Guidelines + +- **Test public APIs only**: Tests should exercise public interfaces, not `_`-prefixed functions or classes. If something is hard to test without reaching into private internals, consider refactoring the code to expose a public entry point +- **Type annotations required**: Test functions and fixtures must include type annotations — `-> None` for tests, typed parameters, and typed return values for fixtures +- **Imports at module level**: Follow the same import rules as production code — keep imports at the top of the file, not inside test functions +- **Parametrize over duplicate**: Use `@pytest.mark.parametrize` (with `ids=` for readable names) instead of writing multiple test functions for variations of the same behavior +- **Minimal fixtures**: Fixtures should be simple — one fixture, one responsibility, just setup with no behavior logic +- **Shared fixtures in `conftest.py`**: Place fixtures shared across a test directory in `conftest.py` +- **Mock at boundaries**: Mock external dependencies (APIs, databases, third-party services), not internal functions +- **Test behavior, not implementation**: Assert on outputs and side effects, not internal call counts (unless verifying routing) +- **Keep mocking shallow**: If a test requires deeply nested mocking, the code under test may need refactoring + +### Example Test + +```python +from typing import Any + +from data_designer.config.config_builder import DataDesignerConfigBuilder + + +def test_something(stub_model_configs: dict[str, Any]) -> None: + """Test description.""" + builder = DataDesignerConfigBuilder(model_configs=stub_model_configs) + # ... test implementation + assert expected == actual +``` + +--- + +## Pre-commit Hooks + +The project uses pre-commit hooks to enforce code quality. Install them with: + +```bash +uv run pre-commit install +``` + +Hooks include: +- Trailing whitespace removal +- End-of-file fixer +- YAML/JSON/TOML validation +- Merge conflict detection +- Debug statement detection +- Ruff linting and formatting + +--- + +## Common Tasks + +```bash +make clean # Clean up generated files +make update-license-headers # Add SPDX headers to new files +make check-all-fix # Format + lint before committing +make test # Run all tests +make coverage # Run tests with coverage report +make perf-import # Profile import time +make perf-import CLEAN=1 # Clean cache first, then profile +``` + +--- + +## Import Performance + +After adding heavy third-party dependencies, verify import performance: + +```bash +make perf-import CLEAN=1 +``` + +If import time regresses, add the dependency to `lazy_heavy_imports.py` — see [STYLEGUIDE.md](STYLEGUIDE.md) for the lazy loading pattern. diff --git a/README.md b/README.md index 4555386da..55e6e6dd8 100644 --- a/README.md +++ b/README.md @@ -112,7 +112,9 @@ data-designer config list # View current settings ### 🤝 Get involved -- **[Contributing Guide](https://nvidia-nemo.github.io/DataDesigner/latest/CONTRIBUTING)** – Help improve Data Designer +This repository supports agent-assisted development — see [CONTRIBUTING.md](CONTRIBUTING.md) for the recommended workflow. + +- **[Contributing Guide](CONTRIBUTING.md)** – How to contribute, including agent-assisted workflows - **[GitHub Issues](https://github.com/NVIDIA-NeMo/DataDesigner/issues)** – Report bugs or make a feature request --- diff --git a/STYLEGUIDE.md b/STYLEGUIDE.md new file mode 100644 index 000000000..dae2807cb --- /dev/null +++ b/STYLEGUIDE.md @@ -0,0 +1,361 @@ +# Style Guide + +This document is the authoritative reference for code style, naming, type annotations, import patterns, and design principles in DataDesigner. It is extracted from the project's coding standards and enforced by `ruff` (>=0.14.10). + +For architectural invariants and project identity, see [AGENTS.md](AGENTS.md). +For development workflow and testing, see [DEVELOPMENT.md](DEVELOPMENT.md). + +--- + +## General Formatting + +- **Line length**: Maximum 120 characters per line +- **Quote style**: Always use double quotes (`"`) for strings +- **Indentation**: Use 4 spaces (never tabs) +- **Target version**: Python 3.10+ + +## License Headers + +All Python files must include the NVIDIA SPDX license header: + +```python +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +``` + +Use `make update-license-headers` to add headers to all files automatically. + +## Future Annotations + +Include `from __future__ import annotations` at the top of every Python source file (after the license header) for deferred type evaluation. + +## Comments + +Only insert comments when code is especially important to understand. For basic code blocks, comments aren't necessary. We want readable code without vacuous comments. + +--- + +## Type Annotations + +Type annotations are REQUIRED for all code in this project. This is strictly enforced for code quality and maintainability. Modern type syntax is enforced by ruff rules `UP006`, `UP007`, and `UP045`. + +- **ALWAYS** add type annotations to all functions, methods, and class attributes (including tests) +- Use primitive types when possible: `list` not `List`, `dict` not `Dict`, `set` not `Set`, `tuple` not `Tuple` (enforced by `UP006`) +- Use modern union syntax with `|` for optional and union types: + - `str | None` not `Optional[str]` (enforced by `UP045`) + - `int | str` not `Union[int, str]` (enforced by `UP007`) +- Only import from `typing` when absolutely necessary for complex generic types +- For Pydantic models, use field-level type annotations + +```python +# Good +def process_items(items: list[str], max_count: int | None = None) -> dict[str, int]: + return {item: len(item) for item in items} + +# Avoid - missing type annotations +def process_items(items, max_count=None): + return {item: len(item) for item in items} + +# Avoid - old-style typing +from typing import List, Dict, Optional +def process_items(items: List[str], max_count: Optional[int] = None) -> Dict[str, int]: + return {item: len(item) for item in items} +``` + +--- + +## Import Style + +- **ALWAYS** use absolute imports, never relative imports (enforced by `TID`) +- Place imports at module level, not inside functions (exception: unavoidable for performance reasons) +- Import sorting is handled by `ruff`'s `isort` — imports should be grouped and sorted: + 1. Standard library imports + 2. Third-party imports (use `lazy_heavy_imports` for heavy libraries) + 3. First-party imports (`data_designer`) +- Use standard import conventions (enforced by `ICN`) + +```python +# Good +from data_designer.config.config_builder import DataDesignerConfigBuilder + +# Bad - relative import (will cause linter errors) +from .config_builder import DataDesignerConfigBuilder + +# Good - imports at module level +from pathlib import Path + +def process_file(filename: str) -> None: + path = Path(filename) + +# Bad - import inside function +def process_file(filename: str) -> None: + from pathlib import Path + path = Path(filename) +``` + +### Lazy Loading and TYPE_CHECKING + +This project uses lazy loading for heavy third-party dependencies to optimize import performance. + +**Heavy third-party libraries** (>100ms import cost) should be lazy-loaded via `lazy_heavy_imports.py`: + +```python +# Don't import directly +import pandas as pd +import numpy as np + +# Use lazy loading with IDE support +from typing import TYPE_CHECKING +from data_designer.lazy_heavy_imports import pd, np + +if TYPE_CHECKING: + import pandas as pd + import numpy as np +``` + +This pattern provides: +- Runtime lazy loading (fast startup) +- Full IDE support (autocomplete, type hints) +- Type checker validation + +See [lazy_heavy_imports.py](packages/data-designer-config/src/data_designer/lazy_heavy_imports.py) for the current list of lazy-loaded libraries. + +#### Adding New Heavy Dependencies + +If you add a new dependency with significant import cost (>100ms): + +1. **Add to `lazy_heavy_imports.py`:** + ```python + _LAZY_IMPORTS = { + # ... existing entries ... + "your_lib": "your_library_name", + } + ``` + +2. **Update imports across codebase:** + ```python + from typing import TYPE_CHECKING + from data_designer.lazy_heavy_imports import your_lib + + if TYPE_CHECKING: + import your_library_name as your_lib + ``` + +3. **Verify with performance test:** + ```bash + make perf-import CLEAN=1 + ``` + +#### TYPE_CHECKING Rules + +`TYPE_CHECKING` blocks defer imports that are only needed for type hints, preventing circular dependencies and reducing import time. + +**DO put in TYPE_CHECKING:** +- Internal `data_designer` imports used **only** in type hints +- Imports that would cause circular dependencies +- Full imports of lazy-loaded libraries for IDE support (e.g., `import pandas as pd` in addition to runtime `from data_designer.lazy_heavy_imports import pd`) + +**DON'T put in TYPE_CHECKING:** +- Standard library imports (`Path`, `Any`, `Callable`, `Literal`, `TypeAlias`, etc.) +- Pydantic model types used in field definitions (needed at runtime for validation) +- Types used in discriminated unions (Pydantic needs them at runtime) +- Any import used at runtime (instantiation, method calls, base classes, etc.) + +**Examples:** + +```python +# CORRECT - Lazy-loaded library with IDE support +from typing import TYPE_CHECKING +from data_designer.lazy_heavy_imports import pd + +if TYPE_CHECKING: + import pandas as pd + +def load_data(path: str) -> pd.DataFrame: + return pd.read_csv(path) + +# CORRECT - Standard library NOT in TYPE_CHECKING +from pathlib import Path +from typing import Any + +def process_file(path: Path) -> Any: + return path.read_text() + +# CORRECT - Internal type-only import +from typing import TYPE_CHECKING + +if TYPE_CHECKING: + from data_designer.engine.models.facade import ModelFacade + +def get_model(model: ModelFacade) -> str: + return model.name + +# INCORRECT - Pydantic field type in TYPE_CHECKING +from typing import TYPE_CHECKING + +if TYPE_CHECKING: + from data_designer.config.models import ModelConfig # Wrong! + +class MyConfig(BaseModel): + model: ModelConfig # Pydantic needs this at runtime! + +# CORRECT - Pydantic field type at runtime +from data_designer.config.models import ModelConfig + +class MyConfig(BaseModel): + model: ModelConfig +``` + +--- + +## Naming Conventions (PEP 8) + +- **Functions and variables**: `snake_case` +- **Classes**: `PascalCase` +- **Constants**: `UPPER_SNAKE_CASE` +- **Private attributes**: prefix with single underscore `_private_var` +- **Function and method names must start with an action verb**: e.g. `get_value_from` not `value_from`, `coerce_to_int` not `to_int`, `extract_usage` not `usage` + +```python +# Good +class DatasetGenerator: + MAX_RETRIES = 3 + + def __init__(self) -> None: + self._cache: dict[str, str] = {} + + def generate_dataset(self, config: dict[str, str]) -> pd.DataFrame: + pass + +# Bad +class dataset_generator: # Should be PascalCase + maxRetries = 3 # Should be UPPER_SNAKE_CASE + + def GenerateDataset(self, Config): # Should be snake_case + pass +``` + +--- + +## Code Organization + +- **Public before private**: Public functions/methods appear before private ones in modules and classes +- **Class method order**: `__init__` and other dunder methods first, then properties, then public methods, then private helpers. Group related method types together (e.g., all `@staticmethod`s in one block, all `@classmethod`s in one block). +- **Prefer public over private for testability**: Use public functions (no `_` prefix) for helpers that benefit from direct testing +- **Section comments in larger modules**: Use `# ---` separators to delineate logical groups (e.g. image parsing, usage extraction, generic accessors) + +--- + +## Design Principles + +**DRY** +- Extract shared logic into pure helper functions rather than duplicating across similar call sites +- Rule of thumb: tolerate duplication until the third occurrence, then extract + +**KISS** +- Prefer flat, obvious code over clever abstractions — two similar lines is better than a premature helper +- When in doubt between DRY and KISS, favor readability over deduplication + +**YAGNI** +- Don't add parameters, config, or abstraction layers for hypothetical future use cases +- Don't generalize until the third caller appears + +**SOLID** +- Wrap third-party exceptions at module boundaries — callers depend on canonical error types, not leaked internals +- Use `Protocol` for contracts between layers +- One function, one job — separate logic from I/O + +--- + +## Common Pitfalls to Avoid + +1. **Mutable default arguments**: + + ```python + # Bad + def add_item(item: str, items: list[str] = []) -> list[str]: + items.append(item) + return items + + # Good + def add_item(item: str, items: list[str] | None = None) -> list[str]: + if items is None: + items = [] + items.append(item) + return items + ``` + +2. **Unused imports and variables**: + + ```python + # Bad + from pathlib import Path + from typing import Any # Not used + + def process() -> None: + pass + + # Good + from pathlib import Path + + def process() -> None: + pass + ``` + +3. **Simplify code where possible** (enforced by `SIM`): + + ```python + # Bad + if condition: + return True + else: + return False + + # Good + return condition + ``` + +4. **Use comprehensions properly**: + + ```python + # Bad + list([x for x in items]) # Unnecessary list() call + + # Good + [x for x in items] + ``` + +5. **Proper return statements**: + + ```python + # Bad - unnecessary else after return + def get_value(condition: bool) -> str: + if condition: + return "yes" + else: + return "no" + + # Good + def get_value(condition: bool) -> str: + if condition: + return "yes" + return "no" + ``` + +--- + +## Active Linter Rules + +The following ruff linter rules are currently enabled (see [pyproject.toml](pyproject.toml)): + +- `W`: pycodestyle warnings +- `F`: pyflakes (unused imports, undefined names) +- `I`: isort (import sorting) +- `ICN`: flake8-import-conventions (standard import names) +- `PIE`: flake8-pie (miscellaneous lints) +- `TID`: flake8-tidy-imports (bans relative imports) +- `UP006`: `List[A]` -> `list[A]` +- `UP007`: `Union[A, B]` -> `A | B` +- `UP045`: `Optional[A]` -> `A | None` + +**Note**: Additional rules (E, N, ANN, B, C4, DTZ, RET, SIM, PTH) are commented out but may be enabled in the future. Write code that would pass these checks for future-proofing. diff --git a/architecture/agent-introspection.md b/architecture/agent-introspection.md new file mode 100644 index 000000000..5f714ca3d --- /dev/null +++ b/architecture/agent-introspection.md @@ -0,0 +1,21 @@ +# Agent Introspection + +> Stub — to be populated. See source code at `packages/data-designer/src/data_designer/cli/`. + +## Overview + + +## Key Components + + +## Data Flow + + +## Design Decisions + + +## Cross-References + +- [System Architecture](overview.md) +- [CLI](cli.md) +- [Config Layer](config.md) diff --git a/architecture/cli.md b/architecture/cli.md new file mode 100644 index 000000000..12b4d99b5 --- /dev/null +++ b/architecture/cli.md @@ -0,0 +1,20 @@ +# CLI + +> Stub — to be populated. See source code at `packages/data-designer/src/data_designer/cli/`. + +## Overview + + +## Key Components + + +## Data Flow + + +## Design Decisions + + +## Cross-References + +- [System Architecture](overview.md) +- [Agent Introspection](agent-introspection.md) diff --git a/architecture/config.md b/architecture/config.md new file mode 100644 index 000000000..894211d67 --- /dev/null +++ b/architecture/config.md @@ -0,0 +1,21 @@ +# Config Layer + +> Stub — to be populated. See source code at `packages/data-designer-config/src/data_designer/config/`. + +## Overview + + +## Key Components + + +## Data Flow + + +## Design Decisions + + +## Cross-References + +- [System Architecture](overview.md) +- [Engine Layer](engine.md) +- [Plugins](plugins.md) diff --git a/architecture/dataset-builders.md b/architecture/dataset-builders.md new file mode 100644 index 000000000..bf055c7af --- /dev/null +++ b/architecture/dataset-builders.md @@ -0,0 +1,21 @@ +# Dataset Builders + +> Stub — to be populated. See source code at `packages/data-designer-engine/src/data_designer/engine/dataset_builders/`. + +## Overview + + +## Key Components + + +## Data Flow + + +## Design Decisions + + +## Cross-References + +- [System Architecture](overview.md) +- [Engine Layer](engine.md) +- [Config Layer](config.md) diff --git a/architecture/engine.md b/architecture/engine.md new file mode 100644 index 000000000..110ca18fb --- /dev/null +++ b/architecture/engine.md @@ -0,0 +1,22 @@ +# Engine Layer + +> Stub — to be populated. See source code at `packages/data-designer-engine/src/data_designer/engine/`. + +## Overview + + +## Key Components + + +## Data Flow + + +## Design Decisions + + +## Cross-References + +- [System Architecture](overview.md) +- [Config Layer](config.md) +- [Dataset Builders](dataset-builders.md) +- [Models](models.md) diff --git a/architecture/mcp.md b/architecture/mcp.md new file mode 100644 index 000000000..a7884d452 --- /dev/null +++ b/architecture/mcp.md @@ -0,0 +1,21 @@ +# MCP + +> Stub — to be populated. See source code at `packages/data-designer-engine/src/data_designer/engine/`. + +## Overview + + +## Key Components + + +## Data Flow + + +## Design Decisions + + +## Cross-References + +- [System Architecture](overview.md) +- [Engine Layer](engine.md) +- [Models](models.md) diff --git a/architecture/models.md b/architecture/models.md new file mode 100644 index 000000000..46e26e242 --- /dev/null +++ b/architecture/models.md @@ -0,0 +1,20 @@ +# Models + +> Stub — to be populated. See source code at `packages/data-designer-engine/src/data_designer/engine/models/`. + +## Overview + + +## Key Components + + +## Data Flow + + +## Design Decisions + + +## Cross-References + +- [System Architecture](overview.md) +- [Engine Layer](engine.md) diff --git a/architecture/overview.md b/architecture/overview.md new file mode 100644 index 000000000..c829d8f6c --- /dev/null +++ b/architecture/overview.md @@ -0,0 +1,22 @@ +# System Architecture + +> Stub — to be populated. See the three packages under `packages/`. + +## Overview + + +## Key Components + + +## Data Flow + + +## Design Decisions + + +## Cross-References + +- [Config Layer](config.md) +- [Engine Layer](engine.md) +- [Models](models.md) +- [Dataset Builders](dataset-builders.md) diff --git a/architecture/plugins.md b/architecture/plugins.md new file mode 100644 index 000000000..09f3ce143 --- /dev/null +++ b/architecture/plugins.md @@ -0,0 +1,21 @@ +# Plugins + +> Stub — to be populated. See source code at `packages/data-designer-config/src/data_designer/`. + +## Overview + + +## Key Components + + +## Data Flow + + +## Design Decisions + + +## Cross-References + +- [System Architecture](overview.md) +- [Config Layer](config.md) +- [Engine Layer](engine.md) diff --git a/architecture/sampling.md b/architecture/sampling.md new file mode 100644 index 000000000..97d811b67 --- /dev/null +++ b/architecture/sampling.md @@ -0,0 +1,21 @@ +# Sampling + +> Stub — to be populated. See source code at `packages/data-designer-engine/src/data_designer/engine/sampling_gen/`. + +## Overview + + +## Key Components + + +## Data Flow + + +## Design Decisions + + +## Cross-References + +- [System Architecture](overview.md) +- [Engine Layer](engine.md) +- [Config Layer](config.md) From 5d777177037939667849ef1fd1957725bc4dcad2 Mon Sep 17 00:00:00 2001 From: Nabin Mulepati Date: Mon, 23 Mar 2026 17:37:05 -0600 Subject: [PATCH 2/7] remove obsolete new-sdg skill The new-sdg skill is superseded by skills/data-designer/, which is the proper usage skill for building datasets. Update .agents/README.md to reference the usage skill's actual location. Made-with: Cursor --- .agents/README.md | 4 +- .agents/skills/new-sdg/SKILL.md | 120 ---------------------- plans/427/agent-first-development-plan.md | 49 ++++----- 3 files changed, 29 insertions(+), 144 deletions(-) delete mode 100644 .agents/skills/new-sdg/SKILL.md diff --git a/.agents/README.md b/.agents/README.md index 4efd1e332..577909377 100644 --- a/.agents/README.md +++ b/.agents/README.md @@ -20,4 +20,6 @@ Tool-specific directories symlink back here so each harness resolves skills from ## Scope -All skills and agents in this directory are for **contributors developing DataDesigner** — not for end users building datasets. If you're looking for usage tooling, see the [product documentation](https://nvidia-nemo.github.io/DataDesigner/). +All skills and agents in this directory are for **contributors developing DataDesigner** — not for end users building datasets. + +The usage skill for building datasets with DataDesigner lives separately at [`skills/data-designer/`](../skills/data-designer/). For product documentation, see the [docs site](https://nvidia-nemo.github.io/DataDesigner/). diff --git a/.agents/skills/new-sdg/SKILL.md b/.agents/skills/new-sdg/SKILL.md deleted file mode 100644 index db9081931..000000000 --- a/.agents/skills/new-sdg/SKILL.md +++ /dev/null @@ -1,120 +0,0 @@ ---- -name: new-sdg -description: Implement a new synthetic data generator using NeMo Data Designer by defining its configuration and executing a preview job. -argument-hint: -disable-model-invocation: true -metadata: - internal: true ---- - -# Your Goal - -Implement a new synthetic data generator using NeMo Data Designer to match the user's specifications below. - - - **$ARGUMENTS** - - -## Getting Exact Specifications - -The user will provide you with some description, but it is likely that you -do not have enough information to precisely define what they want. It is hard -for a user to define everything up front. Ask follow up questions to the user -using the AskUser tool to narrow down on precisely what they want. - -Common things to make precise are: - -- IMPORTANT: What the "axes of diversity" are -- e.g. what should be well represented and diverse in the resulting dataset. -- The kind an nature of any input data to the dataset. -- What variables should be randomized. -- The schema of the final dataset. -- The structure of any required structured output columns. -- What facets of the output dataset are important to capture. - -## Interactive, Iterative Design - -> USER: Request -> YOU: Clarifying AskUser Questions -> YOU: Script Impelmentation (with preview) -> YOU: Script Execution -> YOU: Result Presentation -> YOU: Followup Questions -> USER: Respond -> YOU: ...repeat... - -Very often, the initial implementation will not conform precisely to what the user wants. You are to engage in an **iterative design loop** with the user. As shown -in the example below, you will construct a configuration, then review its outputs, -present those outputs to the user, and ask follow up questions. - -Depending on the user responses, you will then edit the script, re-run it, and present the user with the results and ask followups and so. When showing results to the user DO NOT SUMMARIZE content, it is *very important* that you show them the records as-is so they can make thoughtful decisions. - -DO NOT disengage from this **iterative design loop** unless commanded by the user. - - -## Implementing a NeMo Data Designer Synthetic Data Generator - -- You will be writing a new python script for execution. -- The script should be made in the current working directory, so `$(pwd)/script-name.py`. -- Implement the script as a stand-alone, `uv`-executable script (https://docs.astral.sh/uv/guides/scripts/#creating-a-python-script). -- The script should depend on the latest version of `data-designer`. -- Include other third-party dependencies only if the job requires it. -- Model aliases are required when definining LLM generation columns. -- Before implementing, make sure to use the Explore tool to understand the src/ and docs/. -- Review available model aliases and providers. -- You will need to ask the user what Model Provider they want to use via AskUser tool. -- You may use Web Search to find any information you need to help you construct the SDG, since real-world grounding is key to a good dataset. -- If you need to use a large number of categories for a sampler, just build a pandas DataFrame and use it as a Seed dataset. - -### Model Alises and Providers - -View known model aliases and providers with the following command. You will need a longer timeout on first run (package first-time boot). - -```bash -uv run --with data-designer data-designer config list -``` - -### Real World Seed Data - -Depending on user requirements, you may need to access real-world datasets to serve as Seed datasets for your Data Designer SDG. -In these cases, you may use Web Search tools to search for datasets available on HuggingFace, and use the `datasets` python library -to load them. You will have to convert them to Pandas DataFrames in these cases. - -If you do use real-world data, pay attention to file sizes and avoid large file transfers. Only download small sections of datasets or use a streaming option. - -### Example - -```python -# /// script -# dependencies = [ -# "data-designer", -# ] -# /// - -# ... data designer config_builder implementation - -def build_config() -> DataDesignerConfigBuilder: - """Implements the definition of the synthetic data generator. - """ - config_builder = DataDesignerConfigBuilder() - - ## Add whatever columns need to be added - # config_builder.add_column(...) - # config_builder.add_column(...) - # config_builder.add_column(...) - - return config_builder - -if __name__ == "__main__": - config_builder = build_config() - designer = DataDesigner() - preview = designer.preview(config_builder=config_builder) - - # The following command will print a random sample record - # which you can present to the user - preview.display_sample_record() - - # The raw data is located in this Pandas DataFrame object. - # You can implenent code to display some or all of this - # to STDOUT so you can see the outputs and report to the user. - preview.dataset -``` diff --git a/plans/427/agent-first-development-plan.md b/plans/427/agent-first-development-plan.md index 92c7e3a9d..a66e0ac21 100644 --- a/plans/427/agent-first-development-plan.md +++ b/plans/427/agent-first-development-plan.md @@ -52,7 +52,7 @@ This proposal draws strong inspiration from [NVIDIA/OpenShell](https://github.co ## Phase 0: AGENTS.md -AGENTS.md is injected into every agent prompt. It must land first because every subsequent phase references the architectural invariants it establishes. It should also be the most stable file in the repo — if it changes often, something is wrong. +AGENTS.md is injected into every agent prompt. It must be authored first within the PR because every subsequent deliverable references the architectural invariants it establishes. It should also be the most stable file in the repo — if it changes often, something is wrong. **Current state:** ~500 lines. Mixes project overview, architecture, code style, development workflow, and testing guidance into one file. @@ -404,37 +404,40 @@ Create `.github/workflows/issue-triage.yml`: Land this work as a sequence of incremental PRs rather than a single large rollout: -0. **Phase 0 PR** — `AGENTS.md` restructure (~50 lines). Lands first because it is injected into every agent prompt and every subsequent phase references it. -1. **Phase 1 PR(s)** — agent infrastructure consolidation and path cleanup. -2. **Phase 2 PR(s)** — remaining documentation (`STYLEGUIDE.md`, `DEVELOPMENT.md`, `CONTRIBUTING.md`, `README.md`, and optional `architecture/` skeleton). -3. **Phase 3 PR(s)** — GitHub machinery such as templates, labels, and skill output conformance. -4. **Phase 4** — do not start implementation directly from this plan. Treat it as follow-on work that requires another planning pass, design review, and then its own incremental PRs. - -The default PR boundary should be the phase boundary. If a phase is still too large, split it into a small sequence of focused PRs, but keep the phases ordered and avoid mixing deliverables from different phases in one PR. +1. **PR 1 (Phases 0 + 1 + 2)** — `AGENTS.md` restructure, agent infrastructure consolidation (`.agents/` / `.claude/` cleanup), and foundation documents (`STYLEGUIDE.md`, `DEVELOPMENT.md`, `CONTRIBUTING.md`, `README.md`, and optional `architecture/` skeleton). These phases are tightly coupled: the new `AGENTS.md` establishes vocabulary that the extracted docs and skill paths reference, and shipping them together avoids an intermediate state where `AGENTS.md` is slim but the extracted content doesn't exist yet. +2. **PR 2 (Phase 3)** — GitHub machinery such as templates, labels, and skill output conformance. +3. **Phase 4** — do not start implementation directly from this plan. Treat it as follow-on work that requires another planning pass, design review, and then its own incremental PRs. --- ## Execution Order +### PR 1 — Phases 0 + 1 + 2 + +All steps within PR 1 are implemented in the order listed. The suggested authoring sequence groups tightly coupled work together. + +| Step | Deliverable | Notes | +| ---- | ------------------------------------------------------------- | ------------------------------------------------------------------ | +| 1 | AGENTS.md restructure (~50 lines) | Establishes vocabulary; author first | +| 2 | STYLEGUIDE.md + DEVELOPMENT.md (extracted from old AGENTS.md) | Must exist before AGENTS.md ships so extracted content isn't lost | +| 3 | Skill consolidation (`.agents/`, `.claude/` cleanup) | Paths settle before docs reference them | +| 4 | CONTRIBUTING.md overhaul | References AGENTS.md, DEVELOPMENT.md, and skill paths | +| 5 | README.md updates | References CONTRIBUTING.md | +| 6 | `architecture/` skeleton | Independent; can be authored at any point | -| Step | Deliverable | Dependencies | Parallelizable | -| ---- | ---------------------------------------------------------- | -------------------------------------- | -------------------- | -| 0 | AGENTS.md restructure (~50 lines) | None | -- | -| 1 | Skill consolidation (`.agents/`, `.claude/` cleanup) | Step 0 (AGENTS.md settled) | -- | -| 2 | STYLEGUIDE.md + DEVELOPMENT.md (extracted from old AGENTS.md) | Step 0 | -- | -| 3 | CONTRIBUTING.md overhaul | Step 0 (references it) | -- | -| 4 | README.md updates | CONTRIBUTING.md (references it) | -- | -| 5 | Issue templates | CONTRIBUTING.md (templates link to it) | Yes (with 6-9) | -| 6 | PR template | None | Yes (with 5, 7-9) | -| 7 | CODEOWNERS update | None | Yes (with 5-6, 8-9) | -| 8 | Label creation (via `gh label create`) | None | Yes (with 5-7, 9) | -| 9 | `architecture/` skeleton | None | Yes (with 5-8) | -| 10 | Skill template conformance updates | Issue/PR templates (steps 5-6) | -- | +### PR 2 — Phase 3 +Steps within PR 2 are largely independent and can be authored in parallel. -Step 0 lands first because AGENTS.md is injected into every prompt and establishes the architectural vocabulary. Steps 1-4 are sequential. Steps 5-9 are independent and can be parallelized. Step 10 depends on earlier steps. +| Step | Deliverable | Dependencies | Parallelizable | +| ---- | ---------------------------------- | -------------------------------------- | -------------------- | +| 7 | Issue templates | CONTRIBUTING.md (templates link to it) | Yes (with 8-11) | +| 8 | PR template | None | Yes (with 7, 9-11) | +| 9 | CODEOWNERS update | None | Yes (with 7-8, 10-11)| +| 10 | Label creation (via `gh label create`) | None | Yes (with 7-9, 11) | +| 11 | Skill template conformance updates | Issue/PR templates (steps 7-8) | -- | -In practice, Step 0 is Phase 0, Steps 1-4 map to Phase 1 and Phase 2 PRs, and Steps 5-10 map to one or more Phase 3 PRs. Phase 4 should be planned separately before any implementation PRs are opened. +Phase 4 should be planned separately before any implementation PRs are opened. --- From 51b33a2128cdb439cdf0320075cf6eed809dde07 Mon Sep 17 00:00:00 2001 From: Nabin Mulepati Date: Mon, 23 Mar 2026 18:09:39 -0600 Subject: [PATCH 3/7] docs: expand style guide and refine development docs Add docstring conventions (Google style), Pydantic/dataclass guidance, error handling patterns, and f-string preference to STYLEGUIDE.md. Clarify per-package test targets, flat test style, e2e API key requirement, notebook regeneration commands, and import perf threshold in DEVELOPMENT.md. Point dataset-building agents to the data-designer skill in AGENTS.md and clarify dependency direction arrows. Made-with: Cursor --- AGENTS.md | 6 +-- DEVELOPMENT.md | 37 ++++++++++++----- STYLEGUIDE.md | 109 ++++++++++++++++++++++++++++++++++++++++++++++++- 3 files changed, 137 insertions(+), 15 deletions(-) diff --git a/AGENTS.md b/AGENTS.md index 5d0152a65..e98d412f7 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -1,7 +1,7 @@ # AGENTS.md This file is for agents **developing** DataDesigner — the codebase you are working in. -If you are an agent helping a user **build a dataset**, see the [product documentation](https://nvidia-nemo.github.io/DataDesigner/) and tutorials instead. +If you are an agent helping a user **build a dataset**, use the [`data-designer` skill](skills/data-designer/SKILL.md) and the [product documentation](https://nvidia-nemo.github.io/DataDesigner/) instead. **DataDesigner** is an NVIDIA NeMo framework for creating synthetic datasets from scratch. Users declare what their data should look like (columns, types, relationships, validation rules); the engine figures out how to generate it. Every change you make should preserve this "declare, don't orchestrate" contract. @@ -15,7 +15,7 @@ The `data_designer` namespace is split across three installable packages that me | `data-designer-engine` | `packages/data-designer-engine/` | `data_designer.engine` — column generators, dataset builders, DAG execution, model facade, validators, sampling | | `data-designer` | `packages/data-designer/` | `data_designer.interface` — public `DataDesigner` class, results, errors; `data_designer.cli` — CLI entry point; `data_designer.integrations` | -**Dependency direction:** config ← engine ← interface. Never import against this flow. +**Dependency direction (left depends on right):** interface → engine → config. Never import against this flow. ## Core Concepts @@ -34,7 +34,7 @@ The `data_designer` namespace is split across three installable packages that me ## Structural Invariants -- **Import direction** — config ← engine ← interface. No reverse imports. +- **Import direction** — interface → engine → config (left depends on right). No reverse imports. - **Fast imports** — heavy third-party libraries are lazy-loaded via `data_designer.lazy_heavy_imports`. See [STYLEGUIDE.md](STYLEGUIDE.md) for the pattern. - **No relative imports** — absolute imports only, enforced by ruff rule `TID`. - **Typed code** — all functions, methods, and class attributes require type annotations. Modern syntax: `list[str]`, `str | None`. diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md index 86ea62031..802454006 100644 --- a/DEVELOPMENT.md +++ b/DEVELOPMENT.md @@ -97,25 +97,38 @@ uv run ruff format --check # Check formatting ### Running Tests +`make test` runs all three package test suites in sequence (config, engine, interface). When iterating on a single package, run its tests directly: + ```bash -# Run all tests +# Run all tests (config + engine + interface) make test -# Run tests with verbose output -uv run pytest -v +# Run a single package's tests +make test-config # data-designer-config +make test-engine # data-designer-engine +make test-interface # data-designer (interface) # Run a specific test file uv run pytest tests/config/test_sampler_constraints.py +# Run tests with verbose output +uv run pytest -v + # Run tests with coverage make coverage # View htmlcov/index.html in browser + +# E2E and example tests (slower, require API keys — see README.md for setup) +make test-e2e # end-to-end tests +make test-run-tutorials # run tutorial notebooks as tests +make test-run-recipes # run recipe scripts as tests ``` ### Test Patterns The project uses `pytest` with the following patterns: +- **Flat test functions**: Write standalone `test_*` functions, not `class`-based test suites. Use fixtures and parametrize for shared setup instead of class inheritance. - **Fixtures**: Shared fixtures are provided via `pytest_plugins` from `data_designer.config.testing.fixtures` and `data_designer.engine.testing.fixtures`, plus local `conftest.py` files in each test directory - **Stub configs**: YAML-based configuration stubs for testing (see `stub_data_designer_config_str` fixture) - **Mocking**: Use `unittest.mock.patch` for external services and dependencies @@ -173,13 +186,15 @@ Hooks include: ## Common Tasks ```bash -make clean # Clean up generated files -make update-license-headers # Add SPDX headers to new files -make check-all-fix # Format + lint before committing -make test # Run all tests -make coverage # Run tests with coverage report -make perf-import # Profile import time -make perf-import CLEAN=1 # Clean cache first, then profile +make clean # Clean up generated files +make update-license-headers # Add SPDX headers to new files +make check-all-fix # Format + lint before committing +make test # Run all tests +make coverage # Run tests with coverage report +make perf-import # Profile import time +make perf-import CLEAN=1 # Clean cache first, then profile +make convert-execute-notebooks # Regenerate .ipynb from docs/notebook_source/*.py +make generate-colab-notebooks # Generate Colab-compatible notebooks ``` --- @@ -192,4 +207,4 @@ After adding heavy third-party dependencies, verify import performance: make perf-import CLEAN=1 ``` -If import time regresses, add the dependency to `lazy_heavy_imports.py` — see [STYLEGUIDE.md](STYLEGUIDE.md) for the lazy loading pattern. +There is also a CI test (`test_import_performance` in `packages/data-designer/tests/test_import_perf.py`) that runs 5 import cycles (1 cold + 4 warm) and fails if the average exceeds **3 seconds**. If your dependency causes a regression, add it to `lazy_heavy_imports.py` — see [STYLEGUIDE.md](STYLEGUIDE.md) for the lazy loading pattern. diff --git a/STYLEGUIDE.md b/STYLEGUIDE.md index dae2807cb..036de36d1 100644 --- a/STYLEGUIDE.md +++ b/STYLEGUIDE.md @@ -12,6 +12,7 @@ For development workflow and testing, see [DEVELOPMENT.md](DEVELOPMENT.md). - **Line length**: Maximum 120 characters per line - **Quote style**: Always use double quotes (`"`) for strings - **Indentation**: Use 4 spaces (never tabs) +- **String formatting**: Prefer f-strings. Avoid `.format()` and `%` formatting. - **Target version**: Python 3.10+ ## License Headers @@ -19,7 +20,7 @@ For development workflow and testing, see [DEVELOPMENT.md](DEVELOPMENT.md). All Python files must include the NVIDIA SPDX license header: ```python -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0 ``` @@ -33,6 +34,43 @@ Include `from __future__ import annotations` at the top of every Python source f Only insert comments when code is especially important to understand. For basic code blocks, comments aren't necessary. We want readable code without vacuous comments. +## Docstrings + +Use **Google style** docstrings (`Args:`, `Returns:`, `Raises:`). + +- **Public API classes and functions** get docstrings. Use a one-liner for simple functions; add Google sections for anything with non-obvious parameters or behavior. +- **Private helpers** (`_`-prefixed) don't need docstrings unless the logic is non-obvious. +- **Don't restate the signature** — the docstring should explain *why* or *what*, not repeat the parameter names and types that are already in the annotation. +- **Pydantic config classes** use `Attributes:` and `Inherited Attributes:` sections to document fields. +- **Module docstrings** are optional — use a one-liner after the license header when the module's purpose isn't obvious from its name. + +```python +# Good - Google style with sections +def compile_config(config: DataDesignerConfig, provider: ResourceProvider) -> DataDesignerConfig: + """Compile a raw config into an executable form. + + Resolves seed columns, adds internal IDs, and validates the result. + + Args: + config: The user-provided configuration to compile. + provider: Resource provider for seed dataset resolution. + + Returns: + The compiled configuration ready for execution. + + Raises: + ConfigValidationError: If the configuration is invalid after compilation. + """ + +# Good - one-liner for simple functions +def get_column_names(config: DataDesignerConfig) -> list[str]: + """Return the names of all columns in the config.""" + +# Bad - restates the signature +def get_column_names(config: DataDesignerConfig) -> list[str]: + """Get column names from a DataDesignerConfig and return them as a list of strings.""" +``` + --- ## Type Annotations @@ -242,10 +280,58 @@ class dataset_generator: # Should be PascalCase - **Public before private**: Public functions/methods appear before private ones in modules and classes - **Class method order**: `__init__` and other dunder methods first, then properties, then public methods, then private helpers. Group related method types together (e.g., all `@staticmethod`s in one block, all `@classmethod`s in one block). - **Prefer public over private for testability**: Use public functions (no `_` prefix) for helpers that benefit from direct testing +- **Avoid nested functions**: Define helpers at module level or as private methods on the class. Nested functions hide logic, make testing harder, and complicate stack traces. The only acceptable use is closures that genuinely need to capture local state. - **Section comments in larger modules**: Use `# ---` separators to delineate logical groups (e.g. image parsing, usage extraction, generic accessors) --- +## Pydantic Models and Dataclasses + +**Pydantic** for config, validation, serialization, and schema generation. **Dataclasses** for simple data containers that don't need any of that. + +### Pydantic Models + +- Config models inherit `ConfigBase` (from `data_designer.config.base`), which sets shared defaults: `extra="forbid"`, `use_enum_values=True`, `arbitrary_types_allowed=True`. +- Use `Field()` when you need constraints (`ge`, `le`, `gt`), descriptions, `default_factory`, discriminators, or schema control (`exclude`, `SkipJsonSchema`). Use bare defaults for simple flags and strings. +- Specify validator `mode` explicitly (`mode="before"` or `mode="after"`). Name validators with descriptive verbs: `validate_*` for checks, `normalize_*` for canonicalization, `inject_*` for pre-parse dict shaping. + +```python +# Good - bare defaults for simple fields, Field() for constraints +class RunConfig(ConfigBase): + disable_early_shutdown: bool = False + shutdown_error_rate: float = Field(default=0.5, ge=0.0, le=1.0) + buffer_size: int = Field(default=1000, gt=0) + + @model_validator(mode="after") + def normalize_shutdown_settings(self) -> Self: + if self.disable_early_shutdown: + self.shutdown_error_rate = 1.0 + return self +``` + +### Dataclasses + +Use `@dataclass` for runtime data containers in the engine, CLI, and internal tooling — DTOs, concurrency primitives, task metadata. Prefer `frozen=True, slots=True` for immutable value types. + +```python +@dataclass(frozen=True, slots=True) +class Usage: + prompt_tokens: int + completion_tokens: int + total_tokens: int +``` + +### When to Choose + +| Need | Use | +|------|-----| +| Validation, serialization, JSON schema | Pydantic (`ConfigBase` or `BaseModel`) | +| Typed struct with no validation | `@dataclass` | +| Immutable value object | `@dataclass(frozen=True, slots=True)` | +| Dict-shaped data (e.g., trace JSON) | `TypedDict` | + +--- + ## Design Principles **DRY** @@ -267,6 +353,27 @@ class dataset_generator: # Should be PascalCase --- +## Error Handling + +- Prefer specific exception types over bare `except`. Never catch `Exception` or `BaseException` without re-raising. +- Wrap third-party exceptions at module boundaries into canonical `data_designer` error types (see `data_designer.errors`, `data_designer.interface.errors`). +- Don't use exceptions for control flow — check conditions explicitly instead. +- Re-raise with context so the original traceback is preserved: + +```python +# Good +try: + response = client.chat(messages) +except httpx.HTTPStatusError as exc: + raise ModelClientError(f"LLM request failed: {exc.response.status_code}") from exc + +# Bad - swallows the original traceback +except httpx.HTTPStatusError as exc: + raise ModelClientError("LLM request failed") +``` + +--- + ## Common Pitfalls to Avoid 1. **Mutable default arguments**: From c415e15d95bfed4cd5bc7bd600d082be9159632e Mon Sep 17 00:00:00 2001 From: Nabin Mulepati Date: Mon, 23 Mar 2026 18:14:51 -0600 Subject: [PATCH 4/7] docs: link AGENTS.md to architecture/ directory Made-with: Cursor --- AGENTS.md | 1 + 1 file changed, 1 insertion(+) diff --git a/AGENTS.md b/AGENTS.md index e98d412f7..c15fcd7a2 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -53,3 +53,4 @@ make perf-import CLEAN=1 # profile import time (run after adding heavy deps) For full setup, testing, and workflow details see [DEVELOPMENT.md](DEVELOPMENT.md). For code style, naming, and import conventions see [STYLEGUIDE.md](STYLEGUIDE.md). +For deeper dives into specific subsystems see [`architecture/`](architecture/). From 2254bda92e5a1c952fb5e0460a6ca2556bef152d Mon Sep 17 00:00:00 2001 From: Nabin Mulepati Date: Mon, 23 Mar 2026 18:24:03 -0600 Subject: [PATCH 5/7] docs: refine CONTRIBUTING.md contribution workflow Add plan document step, self-review with multi-model passes, automated CI review expectations, and comment resolution protocol. Made-with: Cursor --- CONTRIBUTING.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 22515d78b..64638801f 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -12,8 +12,8 @@ Agents accelerate work; humans stay accountable. People make design decisions an 1. **Open an issue** using the appropriate [issue template](https://github.com/NVIDIA-NeMo/DataDesigner/issues/new/choose). 2. **Include investigation output.** If you used an agent, paste its diagnostics. If you didn't, include the troubleshooting you tried. -3. **For non-trivial changes, submit a plan** in the issue for review before building. Plans should describe the approach, trade-offs considered, and affected subsystems. -4. **Once approved, implement** using agent-assisted development. See [DEVELOPMENT.md](DEVELOPMENT.md) for local setup and workflow. +3. **For non-trivial changes, create a plan document** at `plans//` before building. Have your agent draft the plan — it should describe the approach, trade-offs considered, affected subsystems, and a delivery strategy. See existing plans in [`plans/`](plans/) for reference. Submit the plan in a PR for review. +4. **Once the plan is approved, implement** using agent-assisted development. See [DEVELOPMENT.md](DEVELOPMENT.md) for local setup and workflow. ## Before You Open an Issue @@ -57,12 +57,14 @@ The repository includes skills for common development tasks. These are located i make check-all-fix make test ``` +- Run a self-review before opening the PR using the `review-code` skill. Address any critical or warning-level findings before requesting human review. If you have access to multiple models, run the review with different models across passes — different models surface different issues, and a single pass rarely catches everything. ### Pull Request Review Process +- PRs receive an automated CI code review. You must address all critical and warning-level findings from the automated review before requesting human review. - Maintainers will review your PR and may request changes - Address feedback by pushing additional commits to your branch -- Reply to the feedback comment with a link to the commit that addresses it +- Reply to each comment before resolving it. If the comment resulted in a code change, include the commit hash that addresses it. Do not resolve comments without a response. - Once approved, a maintainer will merge your PR --- From 05a64f407ec54a65f61f36ce12e91fd26bd0dc90 Mon Sep 17 00:00:00 2001 From: Nabin Mulepati Date: Mon, 23 Mar 2026 18:28:16 -0600 Subject: [PATCH 6/7] docs: add architecture/ to PR 2 scope and link from AGENTS.md Move architecture doc population from deferred/incremental to PR 2 since the subsystems already exist. Update plan delivery strategy, execution order, and out-of-scope sections accordingly. Made-with: Cursor --- plans/427/agent-first-development-plan.md | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/plans/427/agent-first-development-plan.md b/plans/427/agent-first-development-plan.md index a66e0ac21..17c4ed056 100644 --- a/plans/427/agent-first-development-plan.md +++ b/plans/427/agent-first-development-plan.md @@ -405,7 +405,7 @@ Create `.github/workflows/issue-triage.yml`: Land this work as a sequence of incremental PRs rather than a single large rollout: 1. **PR 1 (Phases 0 + 1 + 2)** — `AGENTS.md` restructure, agent infrastructure consolidation (`.agents/` / `.claude/` cleanup), and foundation documents (`STYLEGUIDE.md`, `DEVELOPMENT.md`, `CONTRIBUTING.md`, `README.md`, and optional `architecture/` skeleton). These phases are tightly coupled: the new `AGENTS.md` establishes vocabulary that the extracted docs and skill paths reference, and shipping them together avoids an intermediate state where `AGENTS.md` is slim but the extracted content doesn't exist yet. -2. **PR 2 (Phase 3)** — GitHub machinery such as templates, labels, and skill output conformance. +2. **PR 2 (Phase 3 + architecture content)** — GitHub machinery (templates, labels, skill output conformance) and populating the `architecture/` stubs with content for existing subsystems. 3. **Phase 4** — do not start implementation directly from this plan. Treat it as follow-on work that requires another planning pass, design review, and then its own incremental PRs. --- @@ -425,17 +425,18 @@ All steps within PR 1 are implemented in the order listed. The suggested authori | 5 | README.md updates | References CONTRIBUTING.md | | 6 | `architecture/` skeleton | Independent; can be authored at any point | -### PR 2 — Phase 3 +### PR 2 — Phase 3 + Architecture Content Steps within PR 2 are largely independent and can be authored in parallel. | Step | Deliverable | Dependencies | Parallelizable | | ---- | ---------------------------------- | -------------------------------------- | -------------------- | -| 7 | Issue templates | CONTRIBUTING.md (templates link to it) | Yes (with 8-11) | -| 8 | PR template | None | Yes (with 7, 9-11) | -| 9 | CODEOWNERS update | None | Yes (with 7-8, 10-11)| -| 10 | Label creation (via `gh label create`) | None | Yes (with 7-9, 11) | -| 11 | Skill template conformance updates | Issue/PR templates (steps 7-8) | -- | +| 7 | Issue templates | CONTRIBUTING.md (templates link to it) | Yes (with 8-12) | +| 8 | PR template | None | Yes (with 7, 9-12) | +| 9 | CODEOWNERS update | None | Yes (with 7-8, 10-12)| +| 10 | Label creation (via `gh label create`) | None | Yes (with 7-9, 11-12)| +| 11 | Skill template conformance updates | Issue/PR templates (steps 7-8) | Yes (with 12) | +| 12 | Populate `architecture/` docs | Stubs from PR 1 | Yes (with 7-11) | Phase 4 should be planned separately before any implementation PRs are opened. @@ -447,7 +448,7 @@ Phase 4 should be planned separately before any implementation PRs are opened. - **LLM-powered issue triage** — deliberate choice to keep triage deterministic - **Vouch system** — defer until external contributor volume warrants it - **CI/CD changes** — existing workflows are solid -- **Full architecture docs** — create skeleton only, populate incrementally +- **Full architecture docs** — skeleton created in PR 1; content for existing subsystems populated in PR 2 - **Dependabot / Renovate** — dependency management automation (separate concern) ## Resolved Questions From eb5315bff6df90328454182e043410ef037b8521 Mon Sep 17 00:00:00 2001 From: Nabin Mulepati Date: Tue, 24 Mar 2026 10:38:31 -0600 Subject: [PATCH 7/7] docs: address PR review comments on style guide, dev guide, and contributing Replace pd.DataFrame with list[dict[str, str]] in naming example to avoid contradicting lazy-import guidance in the same file. Soften "enforced by SIM" to note SIM rules are not yet enabled in CI. Fix upstream sync instructions for fork-based contributors. Update copyright year in CONTRIBUTING.md from 2025 to 2026 to match STYLEGUIDE.md. Made-with: Cursor --- CONTRIBUTING.md | 2 +- DEVELOPMENT.md | 12 ++++++++++-- STYLEGUIDE.md | 4 ++-- 3 files changed, 13 insertions(+), 5 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 64638801f..de1bcc561 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -82,7 +82,7 @@ The repository includes skills for common development tasks. These are located i All code files must include the NVIDIA copyright header: ```python -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0 ``` diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md index 802454006..09a33f9b5 100644 --- a/DEVELOPMENT.md +++ b/DEVELOPMENT.md @@ -55,9 +55,17 @@ Example: `nmulepati/feat/123-add-xyz-generator` ### Syncing with Upstream +If you're working from a fork, add the upstream remote first: + +```bash +git remote add upstream https://github.com/NVIDIA-NeMo/DataDesigner.git +``` + +Then sync: + ```bash -git fetch origin -git merge origin/main +git fetch upstream +git merge upstream/main ``` ### Validation Before Committing diff --git a/STYLEGUIDE.md b/STYLEGUIDE.md index 036de36d1..5a77d58f0 100644 --- a/STYLEGUIDE.md +++ b/STYLEGUIDE.md @@ -262,7 +262,7 @@ class DatasetGenerator: def __init__(self) -> None: self._cache: dict[str, str] = {} - def generate_dataset(self, config: dict[str, str]) -> pd.DataFrame: + def generate_dataset(self, config: dict[str, str]) -> list[dict[str, str]]: pass # Bad @@ -409,7 +409,7 @@ except httpx.HTTPStatusError as exc: pass ``` -3. **Simplify code where possible** (enforced by `SIM`): +3. **Simplify code where possible** (`SIM` rules; not yet enforced by CI but code should comply): ```python # Bad