feat(examples): export rollouts to JSON (s)#68
Merged
Conversation
`examples/eval-tui` is a modern, reactive Textual dashboard over `agentix.runner`: a per-instance grid (pending -> setup -> agent -> scoring -> PASS/FAIL/skip/error), a live summary bar (done / resolved / failed / running + throughput), and an event log. In-flight phases are observed by wrapping the dataset/agent adapters (`_adapters.py`), so `agentix.runner` is unchanged. - `--demo N` runs a synthetic, no-Docker batch (reproducible from a seed) — try it instantly. Real runs resolve `module:attr` dataset/agent + a provider, exactly like `agentix-run`. - Standalone example (own lock) — its TUI deps stay out of the core-dev venv. - Verified headlessly: ruff clean + a Textual `run_test()` pilot test that drives the demo to completion (no Docker). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Restructure the single rollout dashboard into a multi-tab Textual app (AgentixTUI) that surfaces each Agentix area: - Rollouts — the live dashboard, refactored into a reusable view widget. - Catalog — installed `agentix*` distributions + `agentix.provider` and `agentix.nix` entry points (pure introspection, no Docker). - Sandboxes / Build / Observability — signposted placeholders for the follow-up PRs that flesh them out. Adds DESIGN.md (the rubrics this iterates against), an idle state so the app is useful with no run attached, and pilot tests for the tabbed app, the idle path, and catalog discovery. ruff + headless run_test green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…l pane Highlight a row in the Rollouts grid to see that instance's full detail (verdict, duration, agent exit, patch size, score breakdown, error) in a side panel, alongside the live event log. The rendered detail text is also exposed on the view for headless assertions. ruff + 4 pilot tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A branded landing tab: a warm-gradient "AGENTIX" banner, live ecosystem stat cards (packages / providers / nix-closures from the same introspection the Catalog uses), a Docker-readiness indicator, and quick hints. Registers a branded Textual theme (best-effort; falls back to the default if the running Textual version's theme API differs). Pure introspection — renders with or without Docker. Adds an Aesthetics rubric (DESIGN.md) and a pilot test. ruff + 5 pilot tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replaces the Sandboxes placeholder with a live readiness view: the known backends (docker / podman / apptainer / daytona / e2b) each probed for usability here — binary on PATH, daemon reachable (a real `<bin> info` subprocess in a worker), or SDK + API key present — plus a short note on the session + remote-invoke model. Degrades gracefully when nothing is installed. ruff + 6 pilot tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replaces the Observability placeholder with a split live feed of the two Agentix side channels: /trace (OTel-style spans) on the left, /log (bridged stdlib logging) on the right. With no run attached it plays a short synthetic demo so the shape is visible; real streams arrive from running sandboxes. ruff + 7 pilot tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replaces the Build placeholder with an interactive planner: a project-path input that live-constructs the `agentix build … --platform … --output …` command, the build model (uv owns Python, Nix owns binaries), and the `agentix.nix` closures that would be staged (real entry-point introspection). Adds number keybindings (1–6) to jump between tabs. The control room now has six live tabs — Overview · Rollouts · Catalog · Sandboxes · Build · Observability. ruff + 9 pilot tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Catalog tab gets a filter input that narrows the distributions/entry-points table by name / kind / detail as you type (title shows matched/total). ruff + 10 pilot tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds `RolloutsView.export_payload()` (a JSON-friendly snapshot: each `Rollout.to_dict()` plus an aggregate) and `export_to(path)`. `s` writes `agentix-rollouts.json` to the cwd and toasts the count; with no results yet it warns instead. This is the unit an RL/eval loop persists for offline analysis or replay. ruff + 12 pilot tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Lets you persist a run straight from the dashboard.
RolloutsView.export_payload()returns a JSON-friendly snapshot — eachRollout.to_dict()(instance id, resolved, patch size, agent exit, score, duration) plus a small aggregate (total/done/resolved/failed). Pressingswritesagentix-rollouts.jsonto the cwd and toasts the count; with nothing collected yet it warns instead of writing an empty file.This serves Agentix's rollout data collection goal directly — the snapshot is the unit an RL/eval loop persists for offline analysis or replay, and it's built only on the runner's existing
Rollout.to_dict().Verification
ruff check; headlessrun_testpilots assert the payload snapshots all instances and thatexport_to(tmp_path)round-trips throughjson.loads— 12 pilot tests green.