Skip to content

feat: agent rollout trace ingestion#399

Open
eric-tramel wants to merge 9 commits intomainfrom
feature/trace-directory-normalizers-v1
Open

feat: agent rollout trace ingestion#399
eric-tramel wants to merge 9 commits intomainfrom
feature/trace-directory-normalizers-v1

Conversation

@eric-tramel
Copy link
Contributor

@eric-tramel eric-tramel commented Mar 11, 2026

Summary

  • Add support for ingesting Claude Code and Codex agent rollout traces as seed datasets, parsed from JSONL files into a normalized message format for distillation pipelines
  • Structured engine/resources/agent_rollout/ flat package with per-format handler modules, build_parse_context() for format-specific setup, and dict-based registry
  • AgentRolloutSeedSource config with AgentRolloutFormat enum (claude_code, codex), default paths, and file pattern resolution
  • Documentation recipes for rollout-based SFT curation

Usage

Claude Code traces (uses default ~/.claude/projects path)

import data_designer.config as dd
from data_designer.interface import DataDesigner

config = dd.DataDesignerConfigBuilder()
config.with_seed_dataset(
    dd.AgentRolloutSeedSource(format=dd.AgentRolloutFormat.CLAUDE_CODE)
)
config.add_column(
    dd.ExpressionColumnConfig(name="trace_summary", expr="{{ final_assistant_message[:200] }}")
)

data_designer = DataDesigner()
results = data_designer.create(config, num_records=10, dataset_name="claude-traces")

Codex traces with explicit path

config = dd.DataDesignerConfigBuilder()
config.with_seed_dataset(
    dd.AgentRolloutSeedSource(
        path="/path/to/codex/sessions",
        format=dd.AgentRolloutFormat.CODEX,
    )
)

Distillation recipe (CLI)

uv run docs/assets/recipes/trace_ingestion/agent_rollout_distillation.py \
    --format claude_code --num-records 20 --preview

uv run docs/assets/recipes/trace_ingestion/agent_rollout_distillation.py \
    --format codex --trace-dir /path/to/sessions --num-records 50

Architecture

engine/resources/agent_rollout/
├── __init__.py       # Public API (5 re-exports)
├── base.py           # AgentRolloutParseContext + AgentRolloutFormatHandler ABC
├── types.py          # NormalizedAgentRolloutRecord (derived fields via __post_init__)
├── utils.py          # Shared helpers (build_message, load_jsonl_rows, etc.)
├── registry.py       # Dict-based format handler registry
├── claude_code.py    # Claude Code handler + format-specific normalization
└── codex.py          # Codex handler + format-specific normalization

Test plan

  • Claude Code handler tests with tmp_path JSONL fixtures (test_claude_code.py)
  • Codex handler tests with tmp_path JSONL fixtures (test_codex.py)
  • Integration tests via AgentRolloutSeedReader in test_seed_reader.py
  • End-to-end tests in test_data_designer.py
  • All tests passing

🤖 Generated with Claude Code

@eric-tramel eric-tramel force-pushed the feature/directory-seed-transforms-v1 branch from 1e62c42 to e1aa97b Compare March 12, 2026 14:01
@eric-tramel eric-tramel force-pushed the feature/trace-directory-normalizers-v1 branch from 4ba6373 to 6c91a06 Compare March 12, 2026 14:22
@eric-tramel eric-tramel self-assigned this Mar 13, 2026
@eric-tramel eric-tramel added enhancement New feature or request labels Mar 13, 2026
@eric-tramel eric-tramel force-pushed the feature/trace-directory-normalizers-v1 branch from 6c91a06 to 8b0999a Compare March 16, 2026 16:00
@eric-tramel eric-tramel changed the title feat: add built-in trace directory normalizers feat: add built-in trace seed sources Mar 16, 2026
@eric-tramel eric-tramel changed the base branch from feature/directory-seed-transforms-v1 to feature/filesystem-seed-readers-v1 March 16, 2026 16:00
@eric-tramel eric-tramel force-pushed the feature/trace-directory-normalizers-v1 branch 2 times, most recently from dd07c95 to 909ff76 Compare March 16, 2026 17:08
@eric-tramel eric-tramel force-pushed the feature/trace-directory-normalizers-v1 branch 2 times, most recently from b2b95cd to 1913096 Compare March 18, 2026 02:10
@eric-tramel eric-tramel changed the base branch from feature/filesystem-seed-readers-v1 to main March 18, 2026 12:31
@eric-tramel eric-tramel changed the title feat: add built-in trace seed sources feat: add AgentRollout seed source and formats Mar 18, 2026
@eric-tramel eric-tramel marked this pull request as ready for review March 18, 2026 13:18
@eric-tramel eric-tramel requested a review from a team as a code owner March 18, 2026 13:18
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 18, 2026

Greptile Summary

This PR adds support for ingesting Claude Code and Codex agent rollout traces as seed datasets, normalizing JSONL-formatted session files into a structured NormalizedAgentRolloutRecord format for use in distillation pipelines. The implementation introduces a clean flat package (engine/resources/agent_rollout/) with an ABC-based handler pattern, a dict-based registry, and a new AgentRolloutSeedSource config class that integrates naturally with the existing FileSystemSeedSource hierarchy.

Key changes and observations:

  • The _PARSE_CONTEXT_UNSET sentinel pattern in AgentRolloutSeedReader correctly addresses the previously-flagged None-cache bug from the earlier review round.
  • The validate_resolved_path_exists model validator in AgentRolloutSeedSource correctly validates the effective path (user-supplied or format default) at construction time, but the _runtime_path it caches is silently reset to None by the inherited model_post_init (which runs after model validators in Pydantic v2). The runtime_path property's lazy fallback means this is not a functional bug, but the validator's cache write is wasted.
  • In the Codex handler (codex.py), pending_reasoning is only cleared when an assistant message or function_call item is consumed. Reasoning events that appear before a non-assistant message would persist and be incorrectly attributed to the next assistant turn — a latent logic issue if real Codex traces interleave user messages between reasoning and assistant responses.
  • The recipe (agent_rollout_distillation.py) is well-structured with good argument handling, partition support, and clear separation of config-building from execution.

Confidence Score: 4/5

  • Safe to merge with minor logic issues to be aware of; both findings are non-breaking in current trace formats.
  • The implementation is well-structured and all previously-flagged issues have been addressed. Two new issues remain: (1) a harmless _runtime_path double-resolution due to Pydantic v2 execution order (wasted work, not incorrect), and (2) a latent pending_reasoning attribution bug in the Codex handler that only manifests if real Codex traces contain reasoning events immediately before non-assistant messages — an ordering that likely doesn't occur in practice. Test coverage is comprehensive, and the public API integrates cleanly with the existing seed reader pattern.
  • packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/codex.py (pending_reasoning clearing logic), packages/data-designer-config/src/data_designer/config/seed_source.py (validate_resolved_path_exists cache interaction with model_post_init)

Important Files Changed

Filename Overview
packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/claude_code.py New Claude Code format handler; parses JSONL session files and normalizes messages, tool calls, and reasoning. Handling for tool_result blocks in assistant content and sidechain detection is solid. Minor: normalize_content_block passes unknown block types (e.g. image) through as-is, which may produce non-text blobs in the messages list.
packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/codex.py New Codex format handler; correctly models session_meta, event_msg reasoning, response_item messages/function calls. Latent logic issue: pending_reasoning is not cleared on non-assistant response_item messages, so accumulated reasoning can incorrectly bleed into a later assistant turn if the trace interleaves user messages between reasoning events and the next assistant turn.
packages/data-designer-config/src/data_designer/config/seed_source.py Adds AgentRolloutSeedSource with optional path/file_pattern and format-aware defaults. Logic issue: _runtime_path set in validate_resolved_path_exists is overwritten to None by the inherited model_post_init (Pydantic v2 calls post_init after after-validators); path is resolved correctly but twice. Validation and lazy property fallback are both correct.
packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/utils.py Shared utilities for JSONL loading, role normalization, and message building. load_jsonl_rows raises AgentRolloutSeedParseError on any bad line (intentional per PR discussion for 1-file=1-session semantics). All helpers are straightforward and well-typed.
packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py Adds AgentRolloutSeedReader with correct _PARSE_CONTEXT_UNSET sentinel (fixing the previously-reviewed None-sentinel bug), lazy parse context caching, file-level error handling, and OSError re-raising as SeedReaderError.
packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/types.py NormalizedAgentRolloutRecord dataclass with derived fields computed in post_init. Clean design; get_field_names() and to_dict() provide convenient serialization support.
packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/base.py Clean ABC for format handlers with default no-op build_parse_context. Frozen AgentRolloutParseContext dataclass provides a sensible base for format-specific subclasses.
packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/registry.py Simple dict-based registry built at import time with stateless handler instances. get_format_handler raises KeyError clearly on unknown formats.
docs/assets/recipes/trace_ingestion/agent_rollout_distillation.py Well-structured recipe demonstrating end-to-end trace ingestion, digest generation, SFT record creation, judge scoring, and partition-aware execution. build_arg_parser naming was previously fixed. No functional issues found.

Sequence Diagram

sequenceDiagram
    participant User
    participant AgentRolloutSeedSource
    participant AgentRolloutSeedReader
    participant FormatRegistry
    participant FormatHandler
    participant JSONL as JSONL File

    User->>AgentRolloutSeedSource: construct(format, path?)
    AgentRolloutSeedSource->>AgentRolloutSeedSource: validate_path (field_validator)
    AgentRolloutSeedSource->>AgentRolloutSeedSource: validate_resolved_path_exists (model_validator after)
    AgentRolloutSeedSource->>AgentRolloutSeedSource: model_post_init (resets _runtime_path if path=None)

    User->>AgentRolloutSeedReader: attach(source, resolver)
    AgentRolloutSeedReader->>AgentRolloutSeedReader: _reset_attachment_state() → _parse_context = UNSET

    User->>AgentRolloutSeedReader: build_manifest(context)
    AgentRolloutSeedReader->>FormatRegistry: get_format_handler(format)
    FormatRegistry-->>AgentRolloutSeedReader: handler
    AgentRolloutSeedReader->>FormatHandler: is_handled_file(relative_path) for each matched file
    FormatHandler-->>AgentRolloutSeedReader: list of handled paths

    User->>AgentRolloutSeedReader: hydrate_row(manifest_row, context)
    AgentRolloutSeedReader->>AgentRolloutSeedReader: _get_parse_context(context)
    alt _parse_context is UNSET
        AgentRolloutSeedReader->>FormatHandler: build_parse_context(root_path, recursive)
        FormatHandler-->>AgentRolloutSeedReader: parse_context (cached)
    end
    AgentRolloutSeedReader->>FormatHandler: parse_file(root_path, relative_path, parse_context)
    FormatHandler->>JSONL: load_jsonl_rows(file_path)
    JSONL-->>FormatHandler: list of (line_no, dict) rows
    FormatHandler->>FormatHandler: normalize messages / tool calls / reasoning
    FormatHandler-->>AgentRolloutSeedReader: list[NormalizedAgentRolloutRecord]
    AgentRolloutSeedReader-->>User: list[dict] (via .to_dict())
Loading

Comments Outside Diff (2)

  1. packages/data-designer-config/src/data_designer/config/seed_source.py, line 137-143 (link)

    _runtime_path cache set here is immediately discarded by model_post_init

    In Pydantic v2, model_post_init is called after model_validator(mode='after'). Because AgentRolloutSeedSource inherits model_post_init from FileSystemSeedSource, when path is None the inherited implementation runs:

    self._runtime_path = None if self.path is None else _resolve_filesystem_runtime_path(self.path)

    This resets _runtime_path to None, throwing away the value set on line 142. The code is functionally correct because the runtime_path property lazily recomputes it, but every AgentRolloutSeedSource constructed without an explicit path resolves the path twice: once here (wasted) and once on the first runtime_path access.

    Consider either overriding model_post_init in AgentRolloutSeedSource to prevent the reset, or removing the self._runtime_path = … assignment from this validator since the property will compute it lazily anyway:

    @model_validator(mode="after")
    def validate_resolved_path_exists(self) -> Self:
        default_path, _ = get_agent_rollout_format_defaults(self.format)
        resolved_path = self.path or default_path
        _validate_filesystem_seed_source_path(resolved_path)
        # _runtime_path is set lazily by the runtime_path property; no need to set it here
        return self
  2. packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/codex.py, line 602-615 (link)

    pending_reasoning leaks into next assistant turn if it precedes a non-assistant message

    pending_reasoning is only cleared when the current message item has role == "assistant" (line 608) or when a function_call is encountered (line 618). If a Codex trace contains event_msg(agent_reasoning) events that are immediately followed by a user message (e.g., a mid-turn user interjection or tool result delivered as a user item), those reasoning snippets survive into pending_reasoning and are later attached to the next assistant turn or function call — incorrectly attributing reasoning from a different context.

    For example, a trace like:

    event_msg(agent_reasoning, "Thinking about A")  →  pending_reasoning = ["Thinking about A"]
    response_item(message, role="user", ...)        →  pending_reasoning still = ["Thinking about A"]  ← not cleared
    response_item(message, role="assistant", ...)   →  reasoning_content = "Thinking about A"  ← wrong
    

    Consider always clearing pending_reasoning after consuming it for any message type, or only accumulating it just before an assistant item is expected:

Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/data-designer-config/src/data_designer/config/seed_source.py
Line: 137-143

Comment:
**`_runtime_path` cache set here is immediately discarded by `model_post_init`**

In Pydantic v2, `model_post_init` is called *after* `model_validator(mode='after')`. Because `AgentRolloutSeedSource` inherits `model_post_init` from `FileSystemSeedSource`, when `path is None` the inherited implementation runs:

```python
self._runtime_path = None if self.path is None else _resolve_filesystem_runtime_path(self.path)
```

This resets `_runtime_path` to `None`, throwing away the value set on line 142. The code is functionally correct because the `runtime_path` property lazily recomputes it, but every `AgentRolloutSeedSource` constructed without an explicit `path` resolves the path twice: once here (wasted) and once on the first `runtime_path` access.

Consider either overriding `model_post_init` in `AgentRolloutSeedSource` to prevent the reset, or removing the `self._runtime_path = …` assignment from this validator since the property will compute it lazily anyway:

```python
@model_validator(mode="after")
def validate_resolved_path_exists(self) -> Self:
    default_path, _ = get_agent_rollout_format_defaults(self.format)
    resolved_path = self.path or default_path
    _validate_filesystem_seed_source_path(resolved_path)
    # _runtime_path is set lazily by the runtime_path property; no need to set it here
    return self
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/codex.py
Line: 602-615

Comment:
**`pending_reasoning` leaks into next assistant turn if it precedes a non-assistant message**

`pending_reasoning` is only cleared when the current message item has `role == "assistant"` (line 608) or when a `function_call` is encountered (line 618). If a Codex trace contains `event_msg(agent_reasoning)` events that are immediately followed by a **user message** (e.g., a mid-turn user interjection or tool result delivered as a user item), those reasoning snippets survive into `pending_reasoning` and are later attached to the *next* assistant turn or function call — incorrectly attributing reasoning from a different context.

For example, a trace like:
```
event_msg(agent_reasoning, "Thinking about A")  →  pending_reasoning = ["Thinking about A"]
response_item(message, role="user", ...)        →  pending_reasoning still = ["Thinking about A"]  ← not cleared
response_item(message, role="assistant", ...)   →  reasoning_content = "Thinking about A"  ← wrong
```

Consider always clearing `pending_reasoning` after consuming it for any message type, or only accumulating it just before an assistant item is expected:

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: "Merge branch 'main' ..."

@eric-tramel eric-tramel changed the title feat: add AgentRollout seed source and formats feat: add AgentRollout seed source with lazy manifest/hydrate architecture Mar 19, 2026
@eric-tramel eric-tramel force-pushed the feature/trace-directory-normalizers-v1 branch from 091ecff to 7d56607 Compare March 19, 2026 13:26
Add support for ingesting Claude Code and Codex agent rollout traces as
seed datasets. Traces are parsed from JSONL files into a normalized message
format suitable for distillation pipelines.

Architecture:
- engine/resources/agent_rollout/ flat package with per-format handler modules
- AgentRolloutFormatHandler ABC with build_parse_context() for format-specific
  setup (e.g. Claude session index loading)
- NormalizedAgentRolloutRecord dataclass with __post_init__ derived fields
- Dict-based format handler registry keyed by AgentRolloutFormat enum
- Built-in filesystem seed readers (directory, file contents) for general use

Includes AgentRolloutSeedSource config, end-to-end tests, handler unit tests,
and documentation recipes for rollout-based SFT curation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@eric-tramel eric-tramel force-pushed the feature/trace-directory-normalizers-v1 branch from f992392 to 5a67544 Compare March 19, 2026 16:42
@eric-tramel eric-tramel changed the title feat: add AgentRollout seed source with lazy manifest/hydrate architecture feat: add agent rollout trace ingestion with structured parsing package Mar 19, 2026
@eric-tramel eric-tramel changed the title feat: add agent rollout trace ingestion with structured parsing package feat: agent rollout trace ingestion Mar 19, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@eric-tramel eric-tramel marked this pull request as draft March 19, 2026 16:56
eric-tramel and others added 6 commits March 19, 2026 13:06
Restore files that were inadvertently modified by the squash commit:
person_reader.py, seed_readers.py, managed_dataset_generator.py,
resource_provider.py, samplers.py, and related tests. Remove
managed_storage.py and managed_dataset_repository.py additions that
belong to a separate feature branch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…emony

- Delete test_utils.py (52 lines) — all functions tested transitively by parse_file tests
- Merge 4 Claude happy-path tests into 1 comprehensive test (10 → 4 tests)
- Merge 4 Codex happy-path tests into 1 comprehensive test (6 → 3 tests)
- Remove normalize_content_block/coerce_raw_blocks tests (internal helpers)
- Delete test_create_dataset_skips_empty_and_malformed_trace_files (redundant with unit tests)
- De-parametrize unhandled-files warning test to Claude-only
- Shrink _write_claude_trace_directory and _write_codex_trace_directory fixtures
- Delete unused helpers: _write_invalid_jsonl, _with_skipped_files, _with_unhandled_files (codex)

Net: -315 lines across 4 files, 22 → 12 rollout-specific tests retained.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the hand-rolled AGENT_ROLLOUT_OUTPUT_COLUMNS list with
NormalizedAgentRolloutRecord.get_field_names(), which introspects the
dataclass fields. Eliminates the risk of column list drifting from the
record definition.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix parse context cache sentinel bug: use _PARSE_CONTEXT_UNSET object
  to distinguish "not yet computed" from "computed and returned None",
  preventing repeated build_parse_context calls for None-returning formats
- Rename parse_args() to build_arg_parser() in recipe script to match
  its return type (ArgumentParser, not parsed Namespace)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@eric-tramel eric-tramel marked this pull request as ready for review March 19, 2026 17:49
- Remove phantom "chat-completion" format from recipe card docs
- Replace dataclasses.asdict deep-copy with shallow field access in to_dict
- Add missing source_kind and message_count assertions in Codex tests
- Fix recursive test to actually discriminate recursive vs non-recursive scanning
- Fix Jinja2 falsy score check (`if score` → `if score is not none`) in recipe
- Fix `self.path or default` → `self.path if self.path is not None else default`
- Remove redundant field validators inherited from FileSystemSeedSource

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant