Conversation
1e62c42 to
e1aa97b
Compare
4ba6373 to
6c91a06
Compare
6c91a06 to
8b0999a
Compare
dd07c95 to
909ff76
Compare
b2b95cd to
1913096
Compare
Greptile SummaryThis PR adds support for ingesting Claude Code and Codex agent rollout traces as seed datasets, normalizing JSONL-formatted session files into a structured Key changes and observations:
|
| Filename | Overview |
|---|---|
| packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/claude_code.py | New Claude Code format handler; parses JSONL session files and normalizes messages, tool calls, and reasoning. Handling for tool_result blocks in assistant content and sidechain detection is solid. Minor: normalize_content_block passes unknown block types (e.g. image) through as-is, which may produce non-text blobs in the messages list. |
| packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/codex.py | New Codex format handler; correctly models session_meta, event_msg reasoning, response_item messages/function calls. Latent logic issue: pending_reasoning is not cleared on non-assistant response_item messages, so accumulated reasoning can incorrectly bleed into a later assistant turn if the trace interleaves user messages between reasoning events and the next assistant turn. |
| packages/data-designer-config/src/data_designer/config/seed_source.py | Adds AgentRolloutSeedSource with optional path/file_pattern and format-aware defaults. Logic issue: _runtime_path set in validate_resolved_path_exists is overwritten to None by the inherited model_post_init (Pydantic v2 calls post_init after after-validators); path is resolved correctly but twice. Validation and lazy property fallback are both correct. |
| packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/utils.py | Shared utilities for JSONL loading, role normalization, and message building. load_jsonl_rows raises AgentRolloutSeedParseError on any bad line (intentional per PR discussion for 1-file=1-session semantics). All helpers are straightforward and well-typed. |
| packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py | Adds AgentRolloutSeedReader with correct _PARSE_CONTEXT_UNSET sentinel (fixing the previously-reviewed None-sentinel bug), lazy parse context caching, file-level error handling, and OSError re-raising as SeedReaderError. |
| packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/types.py | NormalizedAgentRolloutRecord dataclass with derived fields computed in post_init. Clean design; get_field_names() and to_dict() provide convenient serialization support. |
| packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/base.py | Clean ABC for format handlers with default no-op build_parse_context. Frozen AgentRolloutParseContext dataclass provides a sensible base for format-specific subclasses. |
| packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/registry.py | Simple dict-based registry built at import time with stateless handler instances. get_format_handler raises KeyError clearly on unknown formats. |
| docs/assets/recipes/trace_ingestion/agent_rollout_distillation.py | Well-structured recipe demonstrating end-to-end trace ingestion, digest generation, SFT record creation, judge scoring, and partition-aware execution. build_arg_parser naming was previously fixed. No functional issues found. |
Sequence Diagram
sequenceDiagram
participant User
participant AgentRolloutSeedSource
participant AgentRolloutSeedReader
participant FormatRegistry
participant FormatHandler
participant JSONL as JSONL File
User->>AgentRolloutSeedSource: construct(format, path?)
AgentRolloutSeedSource->>AgentRolloutSeedSource: validate_path (field_validator)
AgentRolloutSeedSource->>AgentRolloutSeedSource: validate_resolved_path_exists (model_validator after)
AgentRolloutSeedSource->>AgentRolloutSeedSource: model_post_init (resets _runtime_path if path=None)
User->>AgentRolloutSeedReader: attach(source, resolver)
AgentRolloutSeedReader->>AgentRolloutSeedReader: _reset_attachment_state() → _parse_context = UNSET
User->>AgentRolloutSeedReader: build_manifest(context)
AgentRolloutSeedReader->>FormatRegistry: get_format_handler(format)
FormatRegistry-->>AgentRolloutSeedReader: handler
AgentRolloutSeedReader->>FormatHandler: is_handled_file(relative_path) for each matched file
FormatHandler-->>AgentRolloutSeedReader: list of handled paths
User->>AgentRolloutSeedReader: hydrate_row(manifest_row, context)
AgentRolloutSeedReader->>AgentRolloutSeedReader: _get_parse_context(context)
alt _parse_context is UNSET
AgentRolloutSeedReader->>FormatHandler: build_parse_context(root_path, recursive)
FormatHandler-->>AgentRolloutSeedReader: parse_context (cached)
end
AgentRolloutSeedReader->>FormatHandler: parse_file(root_path, relative_path, parse_context)
FormatHandler->>JSONL: load_jsonl_rows(file_path)
JSONL-->>FormatHandler: list of (line_no, dict) rows
FormatHandler->>FormatHandler: normalize messages / tool calls / reasoning
FormatHandler-->>AgentRolloutSeedReader: list[NormalizedAgentRolloutRecord]
AgentRolloutSeedReader-->>User: list[dict] (via .to_dict())
Comments Outside Diff (2)
-
packages/data-designer-config/src/data_designer/config/seed_source.py, line 137-143 (link)_runtime_pathcache set here is immediately discarded bymodel_post_initIn Pydantic v2,
model_post_initis called aftermodel_validator(mode='after'). BecauseAgentRolloutSeedSourceinheritsmodel_post_initfromFileSystemSeedSource, whenpath is Nonethe inherited implementation runs:self._runtime_path = None if self.path is None else _resolve_filesystem_runtime_path(self.path)
This resets
_runtime_pathtoNone, throwing away the value set on line 142. The code is functionally correct because theruntime_pathproperty lazily recomputes it, but everyAgentRolloutSeedSourceconstructed without an explicitpathresolves the path twice: once here (wasted) and once on the firstruntime_pathaccess.Consider either overriding
model_post_initinAgentRolloutSeedSourceto prevent the reset, or removing theself._runtime_path = …assignment from this validator since the property will compute it lazily anyway:@model_validator(mode="after") def validate_resolved_path_exists(self) -> Self: default_path, _ = get_agent_rollout_format_defaults(self.format) resolved_path = self.path or default_path _validate_filesystem_seed_source_path(resolved_path) # _runtime_path is set lazily by the runtime_path property; no need to set it here return self
-
packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/codex.py, line 602-615 (link)pending_reasoningleaks into next assistant turn if it precedes a non-assistant messagepending_reasoningis only cleared when the current message item hasrole == "assistant"(line 608) or when afunction_callis encountered (line 618). If a Codex trace containsevent_msg(agent_reasoning)events that are immediately followed by a user message (e.g., a mid-turn user interjection or tool result delivered as a user item), those reasoning snippets survive intopending_reasoningand are later attached to the next assistant turn or function call — incorrectly attributing reasoning from a different context.For example, a trace like:
event_msg(agent_reasoning, "Thinking about A") → pending_reasoning = ["Thinking about A"] response_item(message, role="user", ...) → pending_reasoning still = ["Thinking about A"] ← not cleared response_item(message, role="assistant", ...) → reasoning_content = "Thinking about A" ← wrongConsider always clearing
pending_reasoningafter consuming it for any message type, or only accumulating it just before an assistant item is expected:
Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/data-designer-config/src/data_designer/config/seed_source.py
Line: 137-143
Comment:
**`_runtime_path` cache set here is immediately discarded by `model_post_init`**
In Pydantic v2, `model_post_init` is called *after* `model_validator(mode='after')`. Because `AgentRolloutSeedSource` inherits `model_post_init` from `FileSystemSeedSource`, when `path is None` the inherited implementation runs:
```python
self._runtime_path = None if self.path is None else _resolve_filesystem_runtime_path(self.path)
```
This resets `_runtime_path` to `None`, throwing away the value set on line 142. The code is functionally correct because the `runtime_path` property lazily recomputes it, but every `AgentRolloutSeedSource` constructed without an explicit `path` resolves the path twice: once here (wasted) and once on the first `runtime_path` access.
Consider either overriding `model_post_init` in `AgentRolloutSeedSource` to prevent the reset, or removing the `self._runtime_path = …` assignment from this validator since the property will compute it lazily anyway:
```python
@model_validator(mode="after")
def validate_resolved_path_exists(self) -> Self:
default_path, _ = get_agent_rollout_format_defaults(self.format)
resolved_path = self.path or default_path
_validate_filesystem_seed_source_path(resolved_path)
# _runtime_path is set lazily by the runtime_path property; no need to set it here
return self
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/codex.py
Line: 602-615
Comment:
**`pending_reasoning` leaks into next assistant turn if it precedes a non-assistant message**
`pending_reasoning` is only cleared when the current message item has `role == "assistant"` (line 608) or when a `function_call` is encountered (line 618). If a Codex trace contains `event_msg(agent_reasoning)` events that are immediately followed by a **user message** (e.g., a mid-turn user interjection or tool result delivered as a user item), those reasoning snippets survive into `pending_reasoning` and are later attached to the *next* assistant turn or function call — incorrectly attributing reasoning from a different context.
For example, a trace like:
```
event_msg(agent_reasoning, "Thinking about A") → pending_reasoning = ["Thinking about A"]
response_item(message, role="user", ...) → pending_reasoning still = ["Thinking about A"] ← not cleared
response_item(message, role="assistant", ...) → reasoning_content = "Thinking about A" ← wrong
```
Consider always clearing `pending_reasoning` after consuming it for any message type, or only accumulating it just before an assistant item is expected:
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: "Merge branch 'main' ..."
...ges/data-designer-engine/src/data_designer/engine/resources/agent_rollout_format_handlers.py
Outdated
Show resolved
Hide resolved
091ecff to
7d56607
Compare
packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py
Show resolved
Hide resolved
Add support for ingesting Claude Code and Codex agent rollout traces as seed datasets. Traces are parsed from JSONL files into a normalized message format suitable for distillation pipelines. Architecture: - engine/resources/agent_rollout/ flat package with per-format handler modules - AgentRolloutFormatHandler ABC with build_parse_context() for format-specific setup (e.g. Claude session index loading) - NormalizedAgentRolloutRecord dataclass with __post_init__ derived fields - Dict-based format handler registry keyed by AgentRolloutFormat enum - Built-in filesystem seed readers (directory, file contents) for general use Includes AgentRolloutSeedSource config, end-to-end tests, handler unit tests, and documentation recipes for rollout-based SFT curation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
f992392 to
5a67544
Compare
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restore files that were inadvertently modified by the squash commit: person_reader.py, seed_readers.py, managed_dataset_generator.py, resource_provider.py, samplers.py, and related tests. Remove managed_storage.py and managed_dataset_repository.py additions that belong to a separate feature branch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…emony - Delete test_utils.py (52 lines) — all functions tested transitively by parse_file tests - Merge 4 Claude happy-path tests into 1 comprehensive test (10 → 4 tests) - Merge 4 Codex happy-path tests into 1 comprehensive test (6 → 3 tests) - Remove normalize_content_block/coerce_raw_blocks tests (internal helpers) - Delete test_create_dataset_skips_empty_and_malformed_trace_files (redundant with unit tests) - De-parametrize unhandled-files warning test to Claude-only - Shrink _write_claude_trace_directory and _write_codex_trace_directory fixtures - Delete unused helpers: _write_invalid_jsonl, _with_skipped_files, _with_unhandled_files (codex) Net: -315 lines across 4 files, 22 → 12 rollout-specific tests retained. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the hand-rolled AGENT_ROLLOUT_OUTPUT_COLUMNS list with NormalizedAgentRolloutRecord.get_field_names(), which introspects the dataclass fields. Eliminates the risk of column list drifting from the record definition. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix parse context cache sentinel bug: use _PARSE_CONTEXT_UNSET object to distinguish "not yet computed" from "computed and returned None", preventing repeated build_parse_context calls for None-returning formats - Rename parse_args() to build_arg_parser() in recipe script to match its return type (ArgumentParser, not parsed Namespace) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove phantom "chat-completion" format from recipe card docs - Replace dataclasses.asdict deep-copy with shallow field access in to_dict - Add missing source_kind and message_count assertions in Codex tests - Fix recursive test to actually discriminate recursive vs non-recursive scanning - Fix Jinja2 falsy score check (`if score` → `if score is not none`) in recipe - Fix `self.path or default` → `self.path if self.path is not None else default` - Remove redundant field validators inherited from FileSystemSeedSource Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
engine/resources/agent_rollout/flat package with per-format handler modules,build_parse_context()for format-specific setup, and dict-based registryAgentRolloutSeedSourceconfig withAgentRolloutFormatenum (claude_code,codex), default paths, and file pattern resolutionUsage
Claude Code traces (uses default
~/.claude/projectspath)Codex traces with explicit path
Distillation recipe (CLI)
uv run docs/assets/recipes/trace_ingestion/agent_rollout_distillation.py \ --format claude_code --num-records 20 --preview uv run docs/assets/recipes/trace_ingestion/agent_rollout_distillation.py \ --format codex --trace-dir /path/to/sessions --num-records 50Architecture
Test plan
test_claude_code.py)test_codex.py)AgentRolloutSeedReaderintest_seed_reader.pytest_data_designer.py🤖 Generated with Claude Code