TraceForge is a Jac-native batch failure compiler for coding-agent trajectories.
It ingests mini-SWE-agent *.traj.json files, compiles them into a graph, groups recurring failure motifs, localizes likely critical steps, and synthesizes reusable memory updates such as AGENTS.md patches.
TraceForge is now being shaped primarily as a CLI-first tool for Codex CLI and Claude Code workflows. The core job is not to replace those coding agents. The core job is to give them better structured evidence than a raw trajectory dump.
Coding-agent trajectories are long, repetitive, and difficult to compare at batch scale. TraceForge is meant to turn those runs into something judges and users can inspect quickly:
- failure families,
- representative clusters,
- critical-step evidence,
- and reusable operational memory rules.
The most important product loop is:
- generate a
rawevidence pack - generate a
structuredTraceForge evidence pack - let Codex or Claude Code analyze the difference
The thesis is:
same failed run, same outer model, better evidence pack
If another coding agent should operate this repo from the GitHub URL alone, point it at:
- the repo URL:
https://github.com/Dhravidk/TraceForge - AGENTS.md
- quickstart.md
- repo_handoff_prompts.md
The intended fresh-clone path is:
git clone https://github.com/Dhravidk/TraceForge.git
cd TraceForge
./scripts/bootstrap
source .venv/bin/activate
traceforge doctorFrom a fresh clone:
git clone https://github.com/Dhravidk/TraceForge.git
cd TraceForge
./scripts/bootstrap
source .venv/bin/activateThen run TraceForge from the repo root:
traceforge doctor
traceforge analyze-batch --batch sample-starter
traceforge run --batch sample-starter --run premature_completion
traceforge pack --batch sample-starter --run premature_completion --mode raw
traceforge pack --batch sample-starter --run premature_completion --mode structured
traceforge compare --batch sample-starter --run premature_completion --strict-providerFor automation-friendly output, add --json.
Provider resolution follows one rule across the CLI:
- explicit
--providerwins - otherwise a saved preference from
traceforge auth usewins - otherwise logged-in Codex is preferred
- otherwise API-key-backed OpenAI or Anthropic is used if configured
To save a pack artifact for downstream use:
traceforge pack --batch sample-starter --run premature_completion --mode structured --saveTo save a compare artifact for downstream use:
traceforge compare --batch sample-starter --run premature_completion --saveTo generate the whole sample demo bundle in one command:
traceforge demo --batch sample-starter --run premature_completionTo analyze your own trajectories instead of the sample batch:
traceforge analyze-batch --input /path/to/my_batch
traceforge overview --batch upload-my_batch
traceforge run --batch upload-my_batch --run my_run_id
traceforge pack --batch upload-my_batch --run my_run_id --mode structuredThe detailed terminal-first guides are:
- AGENTS.md
- quickstart.md
- provider_setup.md
- agent_workflows.md
- command_reference.md
- demo_playbook.md
- output_schema.md
- repo_handoff_prompts.md
- troubleshooting.md
- validation_notes.md
This repo now supports a CLI-first Jac demo path for sample and local upload batches.
- Jac-native parsing for starter fixtures and richer mini-SWE-agent
*.traj.jsonfields, includinginfo.exit_status,info.model_stats,trajectory_format, tool-call turns, and role-toolobservations - Jac-native deterministic fingerprints and failure-family scoring
- graph compilation into
Batch,Run,Step, artifact, hypothesis, and cluster nodes - graph-backed batch, run, cluster, diagnosis, patch, comparison, and report walkers
- discovered batch catalog with switching between sample and local upload batches
- local upload support for both folders and zip archives containing
*.traj.json - external folder uploads now get managed aliases under
uploads/so they can be analyzed through the normal batch flow - credential-gated typed
by llm()reasoning with deterministic fallback when no model key is present - Jac smoke tests for the starter demo path
- a public
traceforgeCLI wrapper for doctor, run, pack, compare, and export flows - pack-first analysis for raw versus structured evidence on the same failed run
- fair same-schema raw-transcript-vs-TraceForge comparison with explicit blind spots, support points, verifier output, and evidence-window grounding
- blinded evaluation export for side-by-side judging of raw transcript analysis versus TraceForge retrieval
- gold annotation template export for a manually labeled evaluation subset
- rigorous batch evaluation export with provider-aware Anthropic/OpenAI API support and gold-score uplift summaries
- markdown batch report export that doubles as a demo and Devpost backup artifact
The remaining major work is deeper typed by llm() synthesis, final CLI polish, and last-mile demo recording polish. The current repo already supports the judge-facing CLI path: run inspection, raw-versus-structured pack generation, provider-aware comparison, blinded evaluation export, gold annotation export, rigorous uplift scoring, and markdown report export.
Two things are true at once:
- the checked-in
sample-starterbatch is intentionally tiny and exists mainly for smoke tests and deterministic demo rehearsal - the tool has also been exercised on a real external 100-trajectory mini-SWE-agent batch, which is summarized in validation_notes.md
That external validation is enough to show that TraceForge can ingest and summarize a materially larger batch, surface real infrastructure and budget failures, and localize representative runs. It is not yet the same thing as shipping a fully inspectable benchmark artifact set in-repo, and the docs should be read with that distinction in mind.
.
├── LONG_TERM_PLAN.md
├── README.md
├── jac.toml
├── main.jac
├── docs/
│ ├── plans/
│ └── submission/
├── traceforge/
│ ├── __init__.jac
│ ├── schema.jac
│ ├── ingest.jac
│ ├── parser.jac
│ ├── features.jac
│ ├── clustering.jac
│ ├── critical.jac
│ ├── graph_build.jac
│ ├── analysis.jac
│ ├── llm_ops.jac
│ ├── eval.jac
│ ├── reporting.jac
│ ├── api.jac
│ └── ui.jac
├── demo_runs/
├── uploads/
├── exports/
└── tests/
Jac is central to the design:
- graph-native schema for runs, steps, files, patches, tests, and clusters
- walkers as the public API surface and orchestration layer
- typed
by llm()outputs for diagnoses and memory patches - a CLI-first operator path with an optional appendix UI
This is important for JacHacks because meaningful Jac usage is part of the judging criteria.
- Run
traceforge doctor. - Analyze the sample batch.
- Open one failed run in the terminal.
- Show the
rawpack. - Show the
structuredpack. - Explain that the outer model is the same and the evidence pack is better.
- If provider access is healthy, run strict compare.
- End on the markdown report as a fallback artifact.
The preferred product path is now the CLI:
traceforge doctor
traceforge analyze-batch --batch sample-starter
traceforge run --batch sample-starter --run premature_completion
traceforge pack --batch sample-starter --run premature_completion --mode structuredAfter a fresh clone:
./scripts/bootstrap
source .venv/bin/activateCore CLI examples:
traceforge doctor
traceforge analyze-batch --batch sample-starter
traceforge overview --batch sample-starter
traceforge run --batch sample-starter --run premature_completion
traceforge cluster --cluster sample-starter:premature_completion:0
traceforge pack --batch sample-starter --run premature_completion --mode raw
traceforge pack --batch sample-starter --run premature_completion --mode structured
traceforge compare --batch sample-starter --run premature_completion --strict-provider
traceforge export-report --batch sample-starter
traceforge export-eval --batch sample-starter --kind blindProvider-backed examples:
traceforge auth use codex --model gpt-5.4
traceforge compare --batch sample-starter --run premature_completion --strict-provider
traceforge auth use openai --model gpt-5.4 --openai-api-key "$OPENAI_API_KEY"
traceforge compare --batch sample-starter --run invalid_patch --provider openai --strict-provider
traceforge auth use anthropic --model claude-sonnet-4-20250514 --anthropic-api-key "$ANTHROPIC_API_KEY"
traceforge compare --batch sample-starter --run invalid_patch --provider anthropic --strict-providerExplicit provider runs now namespace their eval artifacts so deterministic, API, and Codex outputs can coexist. For example, a Codex run writes files such as sample-starter_codex_gpt_5_4_comparison.json instead of overwriting sample-starter_comparison.json.
Developer-only Jac entrypoints still exist underneath the wrapper, but they are now the internal API layer rather than the recommended operator interface.
After exporting the gold worksheet and filling it in, the supported operator path for rigorous scoring is the CLI:
traceforge export-eval \
--batch sample-starter \
--kind rigorous \
--provider openai \
--annotation-path exports/evals/sample-starter_gold_template.jsonThe lower-level scoring helper still lives in eval.jac, but it is not exposed as a public CLI subcommand beyond export-eval --kind rigorous.
Local upload batches are discovered from folders under uploads that contain *.traj.json files, or from zip archives that get extracted into a top-level upload batch directory. The repo includes local_demo_batch as a fixture for the folder path, and the smoke suite generates a zip fixture at runtime for the archive path.
If you point UploadBatch at an external folder outside uploads/, TraceForge now creates a managed alias under uploads/ so later ParseBatch, AnalyzeBatch, and GetRunView calls work through a stable upload batch ID.
Expected project settings are in jac.toml.
The UI is intentionally secondary to the CLI product. Use it only when you explicitly want a visual appendix or a backup demo surface.
Start it with:
jac start --dev- Long-term architecture brief: LONG_TERM_PLAN.md
- Short-term execution plan: SHORT_TERM_PLAN.md
- Jac-only completion plan: JAC_ONLY_COMPLETION_PLAN.md
The JacHacks site and participant guide emphasize:
- meaningful Jac integration,
- a working demo,
- technical depth,
- real-world impact,
- and a clear 3-minute presentation.
Relevant docs are kept under docs/submission.
Recommended judge path:
- Start in the terminal, not the UI.
- Show
doctor. - Show one failed run.
- Show
rawversusstructuredevidence packs. - If available, run strict provider compare.
- End on exported markdown artifacts as backup.