feat: agent rollout trace ingestion by eric-tramel · Pull Request #399 · NVIDIA-NeMo/DataDesigner

eric-tramel · 2026-03-11T19:30:17Z

Summary

Add support for ingesting Claude Code and Codex agent rollout traces as seed datasets, parsed from JSONL files into a normalized message format for distillation pipelines
Structured engine/resources/agent_rollout/ flat package with per-format handler modules, build_parse_context() for format-specific setup, and dict-based registry
AgentRolloutSeedSource config with AgentRolloutFormat enum (claude_code, codex), default paths, and file pattern resolution
Documentation recipes for rollout-based SFT curation

Usage

Claude Code traces (uses default `~/.claude/projects` path)

import data_designer.config as dd
from data_designer.interface import DataDesigner

config = dd.DataDesignerConfigBuilder()
config.with_seed_dataset(
    dd.AgentRolloutSeedSource(format=dd.AgentRolloutFormat.CLAUDE_CODE)
)
config.add_column(
    dd.ExpressionColumnConfig(name="trace_summary", expr="{{ final_assistant_message[:200] }}")
)

data_designer = DataDesigner()
results = data_designer.create(config, num_records=10, dataset_name="claude-traces")

Codex traces with explicit path

config = dd.DataDesignerConfigBuilder()
config.with_seed_dataset(
    dd.AgentRolloutSeedSource(
        path="/path/to/codex/sessions",
        format=dd.AgentRolloutFormat.CODEX,
    )
)

Distillation recipe (CLI)

uv run docs/assets/recipes/trace_ingestion/agent_rollout_distillation.py \
    --format claude_code --num-records 20 --preview

uv run docs/assets/recipes/trace_ingestion/agent_rollout_distillation.py \
    --format codex --trace-dir /path/to/sessions --num-records 50

Architecture

engine/resources/agent_rollout/
├── __init__.py       # Public API (5 re-exports)
├── base.py           # AgentRolloutParseContext + AgentRolloutFormatHandler ABC
├── types.py          # NormalizedAgentRolloutRecord (derived fields via __post_init__)
├── utils.py          # Shared helpers (build_message, load_jsonl_rows, etc.)
├── registry.py       # Dict-based format handler registry
├── claude_code.py    # Claude Code handler + format-specific normalization
└── codex.py          # Codex handler + format-specific normalization

Test plan

Claude Code handler tests with tmp_path JSONL fixtures (test_claude_code.py)
Codex handler tests with tmp_path JSONL fixtures (test_codex.py)
Integration tests via AgentRolloutSeedReader in test_seed_reader.py
End-to-end tests in test_data_designer.py
All tests passing

🤖 Generated with Claude Code

greptile-apps · 2026-03-18T13:24:25Z

Greptile Summary

This PR adds support for ingesting Claude Code and Codex agent rollout traces as seed datasets, normalizing JSONL-formatted session files into a structured NormalizedAgentRolloutRecord format for use in distillation pipelines. The implementation introduces a clean flat package (engine/resources/agent_rollout/) with an ABC-based handler pattern, a dict-based registry, and a new AgentRolloutSeedSource config class that integrates naturally with the existing FileSystemSeedSource hierarchy.

Key changes and observations:

The _PARSE_CONTEXT_UNSET sentinel pattern in AgentRolloutSeedReader correctly addresses the previously-flagged None-cache bug from the earlier review round.
The validate_resolved_path_exists model validator in AgentRolloutSeedSource correctly validates the effective path (user-supplied or format default) at construction time, but the _runtime_path it caches is silently reset to None by the inherited model_post_init (which runs after model validators in Pydantic v2). The runtime_path property's lazy fallback means this is not a functional bug, but the validator's cache write is wasted.
In the Codex handler (codex.py), pending_reasoning is only cleared when an assistant message or function_call item is consumed. Reasoning events that appear before a non-assistant message would persist and be incorrectly attributed to the next assistant turn — a latent logic issue if real Codex traces interleave user messages between reasoning and assistant responses.
The recipe (agent_rollout_distillation.py) is well-structured with good argument handling, partition support, and clear separation of config-building from execution.

Confidence Score: 4/5

Safe to merge with minor logic issues to be aware of; both findings are non-breaking in current trace formats.
The implementation is well-structured and all previously-flagged issues have been addressed. Two new issues remain: (1) a harmless _runtime_path double-resolution due to Pydantic v2 execution order (wasted work, not incorrect), and (2) a latent pending_reasoning attribution bug in the Codex handler that only manifests if real Codex traces contain reasoning events immediately before non-assistant messages — an ordering that likely doesn't occur in practice. Test coverage is comprehensive, and the public API integrates cleanly with the existing seed reader pattern.
packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/codex.py (pending_reasoning clearing logic), packages/data-designer-config/src/data_designer/config/seed_source.py (validate_resolved_path_exists cache interaction with model_post_init)

Important Files Changed

Filename	Overview
packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/claude_code.py	New Claude Code format handler; parses JSONL session files and normalizes messages, tool calls, and reasoning. Handling for tool_result blocks in assistant content and sidechain detection is solid. Minor: `normalize_content_block` passes unknown block types (e.g. image) through as-is, which may produce non-text blobs in the messages list.
packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/codex.py	New Codex format handler; correctly models session_meta, event_msg reasoning, response_item messages/function calls. Latent logic issue: pending_reasoning is not cleared on non-assistant response_item messages, so accumulated reasoning can incorrectly bleed into a later assistant turn if the trace interleaves user messages between reasoning events and the next assistant turn.
packages/data-designer-config/src/data_designer/config/seed_source.py	Adds AgentRolloutSeedSource with optional path/file_pattern and format-aware defaults. Logic issue: _runtime_path set in validate_resolved_path_exists is overwritten to None by the inherited model_post_init (Pydantic v2 calls post_init after after-validators); path is resolved correctly but twice. Validation and lazy property fallback are both correct.
packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/utils.py	Shared utilities for JSONL loading, role normalization, and message building. load_jsonl_rows raises AgentRolloutSeedParseError on any bad line (intentional per PR discussion for 1-file=1-session semantics). All helpers are straightforward and well-typed.
packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py	Adds AgentRolloutSeedReader with correct _PARSE_CONTEXT_UNSET sentinel (fixing the previously-reviewed None-sentinel bug), lazy parse context caching, file-level error handling, and OSError re-raising as SeedReaderError.
packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/types.py	NormalizedAgentRolloutRecord dataclass with derived fields computed in post_init. Clean design; get_field_names() and to_dict() provide convenient serialization support.
packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/base.py	Clean ABC for format handlers with default no-op build_parse_context. Frozen AgentRolloutParseContext dataclass provides a sensible base for format-specific subclasses.
packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/registry.py	Simple dict-based registry built at import time with stateless handler instances. get_format_handler raises KeyError clearly on unknown formats.
docs/assets/recipes/trace_ingestion/agent_rollout_distillation.py	Well-structured recipe demonstrating end-to-end trace ingestion, digest generation, SFT record creation, judge scoring, and partition-aware execution. build_arg_parser naming was previously fixed. No functional issues found.

Sequence Diagram

sequenceDiagram
    participant User
    participant AgentRolloutSeedSource
    participant AgentRolloutSeedReader
    participant FormatRegistry
    participant FormatHandler
    participant JSONL as JSONL File

    User->>AgentRolloutSeedSource: construct(format, path?)
    AgentRolloutSeedSource->>AgentRolloutSeedSource: validate_path (field_validator)
    AgentRolloutSeedSource->>AgentRolloutSeedSource: validate_resolved_path_exists (model_validator after)
    AgentRolloutSeedSource->>AgentRolloutSeedSource: model_post_init (resets _runtime_path if path=None)

    User->>AgentRolloutSeedReader: attach(source, resolver)
    AgentRolloutSeedReader->>AgentRolloutSeedReader: _reset_attachment_state() → _parse_context = UNSET

    User->>AgentRolloutSeedReader: build_manifest(context)
    AgentRolloutSeedReader->>FormatRegistry: get_format_handler(format)
    FormatRegistry-->>AgentRolloutSeedReader: handler
    AgentRolloutSeedReader->>FormatHandler: is_handled_file(relative_path) for each matched file
    FormatHandler-->>AgentRolloutSeedReader: list of handled paths

    User->>AgentRolloutSeedReader: hydrate_row(manifest_row, context)
    AgentRolloutSeedReader->>AgentRolloutSeedReader: _get_parse_context(context)
    alt _parse_context is UNSET
        AgentRolloutSeedReader->>FormatHandler: build_parse_context(root_path, recursive)
        FormatHandler-->>AgentRolloutSeedReader: parse_context (cached)
    end
    AgentRolloutSeedReader->>FormatHandler: parse_file(root_path, relative_path, parse_context)
    FormatHandler->>JSONL: load_jsonl_rows(file_path)
    JSONL-->>FormatHandler: list of (line_no, dict) rows
    FormatHandler->>FormatHandler: normalize messages / tool calls / reasoning
    FormatHandler-->>AgentRolloutSeedReader: list[NormalizedAgentRolloutRecord]
    AgentRolloutSeedReader-->>User: list[dict] (via .to_dict())

Comments Outside Diff (2)

packages/data-designer-config/src/data_designer/config/seed_source.py, line 137-143 (link)

_runtime_path cache set here is immediately discarded by model_post_init

In Pydantic v2, model_post_init is called after model_validator(mode='after'). Because AgentRolloutSeedSource inherits model_post_init from FileSystemSeedSource, when path is None the inherited implementation runs:
```
self._runtime_path = None if self.path is None else _resolve_filesystem_runtime_path(self.path)
```
This resets _runtime_path to None, throwing away the value set on line 142. The code is functionally correct because the runtime_path property lazily recomputes it, but every AgentRolloutSeedSource constructed without an explicit path resolves the path twice: once here (wasted) and once on the first runtime_path access.

Consider either overriding model_post_init in AgentRolloutSeedSource to prevent the reset, or removing the self._runtime_path = … assignment from this validator since the property will compute it lazily anyway:
```
@model_validator(mode="after")
def validate_resolved_path_exists(self) -> Self:
    default_path, _ = get_agent_rollout_format_defaults(self.format)
    resolved_path = self.path or default_path
    _validate_filesystem_seed_source_path(resolved_path)
    # _runtime_path is set lazily by the runtime_path property; no need to set it here
    return self
```
packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/codex.py, line 602-615 (link)

pending_reasoning leaks into next assistant turn if it precedes a non-assistant message

pending_reasoning is only cleared when the current message item has role == "assistant" (line 608) or when a function_call is encountered (line 618). If a Codex trace contains event_msg(agent_reasoning) events that are immediately followed by a user message (e.g., a mid-turn user interjection or tool result delivered as a user item), those reasoning snippets survive into pending_reasoning and are later attached to the next assistant turn or function call — incorrectly attributing reasoning from a different context.

For example, a trace like:
```
event_msg(agent_reasoning, "Thinking about A")  →  pending_reasoning = ["Thinking about A"]
response_item(message, role="user", ...)        →  pending_reasoning still = ["Thinking about A"]  ← not cleared
response_item(message, role="assistant", ...)   →  reasoning_content = "Thinking about A"  ← wrong
```
Consider always clearing pending_reasoning after consuming it for any message type, or only accumulating it just before an assistant item is expected:

Prompt To Fix All With AI

This is a comment left during a code review.
Path: packages/data-designer-config/src/data_designer/config/seed_source.py
Line: 137-143

Comment:
**`_runtime_path` cache set here is immediately discarded by `model_post_init`**

In Pydantic v2, `model_post_init` is called *after* `model_validator(mode='after')`. Because `AgentRolloutSeedSource` inherits `model_post_init` from `FileSystemSeedSource`, when `path is None` the inherited implementation runs:

```python
self._runtime_path = None if self.path is None else _resolve_filesystem_runtime_path(self.path)
```

This resets `_runtime_path` to `None`, throwing away the value set on line 142. The code is functionally correct because the `runtime_path` property lazily recomputes it, but every `AgentRolloutSeedSource` constructed without an explicit `path` resolves the path twice: once here (wasted) and once on the first `runtime_path` access.

Consider either overriding `model_post_init` in `AgentRolloutSeedSource` to prevent the reset, or removing the `self._runtime_path = …` assignment from this validator since the property will compute it lazily anyway:

```python
@model_validator(mode="after")
def validate_resolved_path_exists(self) -> Self:
    default_path, _ = get_agent_rollout_format_defaults(self.format)
    resolved_path = self.path or default_path
    _validate_filesystem_seed_source_path(resolved_path)
    # _runtime_path is set lazily by the runtime_path property; no need to set it here
    return self
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/codex.py
Line: 602-615

Comment:
**`pending_reasoning` leaks into next assistant turn if it precedes a non-assistant message**

`pending_reasoning` is only cleared when the current message item has `role == "assistant"` (line 608) or when a `function_call` is encountered (line 618). If a Codex trace contains `event_msg(agent_reasoning)` events that are immediately followed by a **user message** (e.g., a mid-turn user interjection or tool result delivered as a user item), those reasoning snippets survive into `pending_reasoning` and are later attached to the *next* assistant turn or function call — incorrectly attributing reasoning from a different context.

For example, a trace like:
```
event_msg(agent_reasoning, "Thinking about A")  →  pending_reasoning = ["Thinking about A"]
response_item(message, role="user", ...)        →  pending_reasoning still = ["Thinking about A"]  ← not cleared
response_item(message, role="assistant", ...)   →  reasoning_content = "Thinking about A"  ← wrong
```

Consider always clearing `pending_reasoning` after consuming it for any message type, or only accumulating it just before an assistant item is expected:

How can I resolve this? If you propose a fix, please make it concise.

_{Last reviewed commit: "Merge branch 'main' ..."}

packages/data-designer-config/tests/config/test_seed_source.py

...ges/data-designer-engine/src/data_designer/engine/resources/agent_rollout_format_handlers.py

docs/assets/recipes/trace_ingestion/agent_rollout_distillation.py

packages/data-designer-config/src/data_designer/config/seed_source.py

packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py

Add support for ingesting Claude Code and Codex agent rollout traces as seed datasets. Traces are parsed from JSONL files into a normalized message format suitable for distillation pipelines. Architecture: - engine/resources/agent_rollout/ flat package with per-format handler modules - AgentRolloutFormatHandler ABC with build_parse_context() for format-specific setup (e.g. Claude session index loading) - NormalizedAgentRolloutRecord dataclass with __post_init__ derived fields - Dict-based format handler registry keyed by AgentRolloutFormat enum - Built-in filesystem seed readers (directory, file contents) for general use Includes AgentRolloutSeedSource config, end-to-end tests, handler unit tests, and documentation recipes for rollout-based SFT curation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Restore files that were inadvertently modified by the squash commit: person_reader.py, seed_readers.py, managed_dataset_generator.py, resource_provider.py, samplers.py, and related tests. Remove managed_storage.py and managed_dataset_repository.py additions that belong to a separate feature branch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…emony - Delete test_utils.py (52 lines) — all functions tested transitively by parse_file tests - Merge 4 Claude happy-path tests into 1 comprehensive test (10 → 4 tests) - Merge 4 Codex happy-path tests into 1 comprehensive test (6 → 3 tests) - Remove normalize_content_block/coerce_raw_blocks tests (internal helpers) - Delete test_create_dataset_skips_empty_and_malformed_trace_files (redundant with unit tests) - De-parametrize unhandled-files warning test to Claude-only - Shrink _write_claude_trace_directory and _write_codex_trace_directory fixtures - Delete unused helpers: _write_invalid_jsonl, _with_skipped_files, _with_unhandled_files (codex) Net: -315 lines across 4 files, 22 → 12 rollout-specific tests retained. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace the hand-rolled AGENT_ROLLOUT_OUTPUT_COLUMNS list with NormalizedAgentRolloutRecord.get_field_names(), which introspects the dataclass fields. Eliminates the risk of column list drifting from the record definition. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Fix parse context cache sentinel bug: use _PARSE_CONTEXT_UNSET object to distinguish "not yet computed" from "computed and returned None", preventing repeated build_parse_context calls for None-returning formats - Rename parse_args() to build_arg_parser() in recipe script to match its return type (ArgumentParser, not parsed Namespace) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove phantom "chat-completion" format from recipe card docs - Replace dataclasses.asdict deep-copy with shallow field access in to_dict - Add missing source_kind and message_count assertions in Codex tests - Fix recursive test to actually discriminate recursive vs non-recursive scanning - Fix Jinja2 falsy score check (`if score` → `if score is not none`) in recipe - Fix `self.path or default` → `self.path if self.path is not None else default` - Remove redundant field validators inherited from FileSystemSeedSource Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eric-tramel mentioned this pull request Mar 11, 2026

feat: directory seed transforms for agent trace ingestion #390

Closed

eric-tramel force-pushed the feature/directory-seed-transforms-v1 branch from 1e62c42 to e1aa97b Compare March 12, 2026 14:01

eric-tramel force-pushed the feature/trace-directory-normalizers-v1 branch from 4ba6373 to 6c91a06 Compare March 12, 2026 14:22

eric-tramel self-assigned this Mar 13, 2026

eric-tramel added enhancement New feature or request labels Mar 13, 2026

eric-tramel force-pushed the feature/trace-directory-normalizers-v1 branch from 6c91a06 to 8b0999a Compare March 16, 2026 16:00

eric-tramel changed the title ~~feat: add built-in trace directory normalizers~~ feat: add built-in trace seed sources Mar 16, 2026

eric-tramel changed the base branch from feature/directory-seed-transforms-v1 to feature/filesystem-seed-readers-v1 March 16, 2026 16:00

eric-tramel force-pushed the feature/trace-directory-normalizers-v1 branch 2 times, most recently from dd07c95 to 909ff76 Compare March 16, 2026 17:08

andreatgretel mentioned this pull request Mar 16, 2026

feat: add built-in filesystem seed readers #421

Merged

eric-tramel force-pushed the feature/trace-directory-normalizers-v1 branch 2 times, most recently from b2b95cd to 1913096 Compare March 18, 2026 02:10

eric-tramel changed the base branch from feature/filesystem-seed-readers-v1 to main March 18, 2026 12:31

eric-tramel changed the title ~~feat: add built-in trace seed sources~~ feat: add AgentRollout seed source and formats Mar 18, 2026

eric-tramel marked this pull request as ready for review March 18, 2026 13:18

eric-tramel requested a review from a team as a code owner March 18, 2026 13:18

greptile-apps bot reviewed Mar 18, 2026

View reviewed changes

packages/data-designer-config/tests/config/test_seed_source.py Show resolved Hide resolved

...ges/data-designer-engine/src/data_designer/engine/resources/agent_rollout_format_handlers.py Outdated Show resolved Hide resolved

eric-tramel changed the title ~~feat: add AgentRollout seed source and formats~~ feat: add AgentRollout seed source with lazy manifest/hydrate architecture Mar 19, 2026

eric-tramel force-pushed the feature/trace-directory-normalizers-v1 branch from 091ecff to 7d56607 Compare March 19, 2026 13:26

greptile-apps bot reviewed Mar 19, 2026

View reviewed changes

docs/assets/recipes/trace_ingestion/agent_rollout_distillation.py Show resolved Hide resolved

greptile-apps bot reviewed Mar 19, 2026

View reviewed changes

packages/data-designer-config/src/data_designer/config/seed_source.py Show resolved Hide resolved

greptile-apps bot reviewed Mar 19, 2026

View reviewed changes

packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py Show resolved Hide resolved

eric-tramel force-pushed the feature/trace-directory-normalizers-v1 branch from f992392 to 5a67544 Compare March 19, 2026 16:42

eric-tramel changed the title ~~feat: add AgentRollout seed source with lazy manifest/hydrate architecture~~ feat: add agent rollout trace ingestion with structured parsing package Mar 19, 2026

eric-tramel changed the title ~~feat: add agent rollout trace ingestion with structured parsing package~~ feat: agent rollout trace ingestion Mar 19, 2026

chore: update license headers to 2026 in agent_rollout package

98be91e

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eric-tramel marked this pull request as draft March 19, 2026 16:56

eric-tramel and others added 6 commits March 19, 2026 13:06

chore: remove seed-datasets.md changes from rollout PR

12f90eb

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge branch 'main' into feature/trace-directory-normalizers-v1

82a98f3

eric-tramel marked this pull request as ready for review March 19, 2026 17:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: agent rollout trace ingestion#399

feat: agent rollout trace ingestion#399
eric-tramel wants to merge 9 commits intomainfrom
feature/trace-directory-normalizers-v1

eric-tramel commented Mar 11, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Mar 18, 2026 •

edited

Loading

Confidence Score: 4/5

Sequence Diagram

Comments Outside Diff (2)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eric-tramel commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Usage

Claude Code traces (uses default ~/.claude/projects path)

Codex traces with explicit path

Distillation recipe (CLI)

Architecture

Test plan

Uh oh!

greptile-apps bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (2)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

eric-tramel commented Mar 11, 2026 •

edited

Loading

Claude Code traces (uses default `~/.claude/projects` path)

greptile-apps bot commented Mar 18, 2026 •

edited

Loading