Skip to content

feat(s12_evaluate): register binary_classify in default stage (v0.25.0)#31

Merged
CocoRoF merged 1 commit intomainfrom
feat/register-binary-classify
Apr 20, 2026
Merged

feat(s12_evaluate): register binary_classify in default stage (v0.25.0)#31
CocoRoF merged 1 commit intomainfrom
feat/register-binary-classify

Conversation

@CocoRoF
Copy link
Copy Markdown
Owner

@CocoRoF CocoRoF commented Apr 20, 2026

Summary

Two additive changes:

  1. binary_classify is now registered in the default Stage 12 (EvaluateStage) strategy slot alongside signal_based / criteria_based / agent_evaluation. A manifest with strategies={\"strategy\": \"binary_classify\"} now resolves to a real BinaryClassifyEvaluation instance instead of silently falling back to SignalBasedEvaluation.
  2. BinaryClassifyEvaluation.configure(config: dict) is given a working body, so strategy_configs={\"strategy\": {\"easy_max_turns\": N, \"not_easy_max_turns\": M}} flows through at manifest-restore time.

Why

Before v0.25.0, binary_classify lived only in the adaptive artifact module and was reachable only via the builder's .with_evaluate(strategy=BinaryClassifyEvaluation(...)) kwarg. That meant serializing the worker_adaptive preset through an EnvironmentManifest silently degraded it to signal-based evaluation — its adaptive identity was lost the moment it passed through a manifest round-trip.

Geny's manifest-first cutover (dev_docs/20260420_3/plan/02_default_env_per_role.md) requires build_default_manifest.stages to emit the full preset layout declaratively. This PR makes that faithful for worker_adaptive.

Backwards compatibility

No breaking changes. The adaptive artifact remains strategy-only; its Python import path (from geny_executor.stages.s12_evaluate.artifact.adaptive.strategy import BinaryClassifyEvaluation) is unchanged. The default registry keeps every pre-existing strategy — this PR is purely additive inside EvaluateStage.__init__.

Test plan

  • tests/unit/test_binary_classify_manifest.py — 6 new tests
  • Full suite: 1029 passed, 18 skipped (up from 1023)
  • ruff check + ruff format --check clean
  • CHANGELOG.md [0.25.0] entry
  • Version bumped to 0.25.0 in both pyproject.toml and __init__.py

🤖 Generated with Claude Code

Adds 'binary_classify' to the default EvaluateStage's strategy slot
registry so manifests with
  strategies={"strategy": "binary_classify"}
resolve to a real BinaryClassifyEvaluation instance instead of
silently falling back to SignalBasedEvaluation.

Also gives BinaryClassifyEvaluation.configure(...) a working body
so strategy_configs flows through — easy_max_turns and
not_easy_max_turns land on the strategy's internal config at
manifest-restore time.

This unblocks Geny's manifest-first cutover (PR10 of the
20260420_3 cycle) where build_default_manifest.stages emits a
StageManifestEntry for worker_adaptive's Stage 12 with
strategies.strategy = "binary_classify". Without this change,
serializing worker_adaptive through an EnvironmentManifest silently
degraded it to signal-based evaluation.

No breaking changes. The adaptive artifact remains strategy-only
and its Python import path is unchanged. The default registry keeps
all three pre-existing strategies; binary_classify is additive.

Tests: tests/unit/test_binary_classify_manifest.py (6 new tests).
Full suite: 1029 passed, 18 skipped. Ruff + format clean.

Refs: Geny/dev_docs/20260420_3/plan/02_default_env_per_role.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@CocoRoF CocoRoF merged commit ece7b15 into main Apr 20, 2026
6 checks passed
@CocoRoF CocoRoF deleted the feat/register-binary-classify branch April 20, 2026 07:24
CocoRoF added a commit to CocoRoF/Geny that referenced this pull request Apr 20, 2026
Fills in build_default_manifest.stages (previously an empty list with
a "filled in by a later PR" comment) with StageManifestEntry objects
that mirror the worker_adaptive and vtuber GenyPresets stage chains.
Also bumps the executor pin to >=0.25.0,<0.26.0 so manifest-restore
can resolve the binary_classify evaluator strategy that
worker_adaptive depends on.

### What the manifest carries

- Full active stage list per preset (worker_adaptive: 1–9 + 12, 13,
  15, 16; vtuber: same minus Stage 8/Think).
- Per-stage artifact name ("default" throughout) and strategy slot
  selections that exactly match the preset builder's output (e.g.
  cache strategy "aggressive_cache" for worker vs "system_cache" for
  vtuber, evaluator "binary_classify" vs "signal_based").
- Static configs: loop.max_turns (30 for worker, 10 for vtuber),
  binary_classify.{easy_max_turns, not_easy_max_turns}.
- tools.built_in (Read/Write/Edit/Bash/Glob/Grep, shared constant),
  tools.external (plumbed through from the caller's whitelist).

### What the manifest does NOT carry (and why)

Three slots are declared with *default* strategies and are meant to
be overwritten by Pipeline.attach_runtime(...) at session start:

  - context.retriever   → GenyMemoryRetriever   (needs per-session memory_manager)
  - memory.strategy     → GenyMemoryStrategy    (needs memory_manager + llm_reflect + curated_km)
  - memory.persistence  → GenyPersistence       (needs memory_manager)

This matches the declarative-only principle called out in
dev_docs/20260420_3/plan/02_default_env_per_role.md: the manifest
expresses stage shape and static params, runtime-scoped Python
objects attach post-construction.

Stage 10 (tool) is not in the declarative list — the preset
registers it conditionally on `tools=` being passed, and the tool
registry is built from tools.external + adhoc_providers at
session-build time via Pipeline.from_manifest_async.

Stage 3 (system) declares builder="composable" to match the preset,
but the ComposablePromptBuilder's block list (PersonaBlock +
DateTimeBlock + MemoryContextBlock) is runtime state and isn't
encoded here. A later PR will expand attach_runtime to accept a
system builder; until then, session-build code will either set the
blocks itself or continue to fall back to the preset path.

### Parity verification

Ad-hoc smoke test (/tmp/test_manifest_parity.py) compares the
manifest-built pipeline to GenyPresets.{worker_adaptive, vtuber}
on stage orders, per-stage artifact, per-slot strategy name (minus
the three runtime-swapped slots), loop.max_turns, and the
binary_classify strategy's live config. All 17 assertions pass.

### Why bump to v0.25.0

v0.25.0 registers binary_classify in the default EvaluateStage's
strategy slot registry. Without it, worker_adaptive's manifest
silently restored to signal_based evaluation. See
CocoRoF/geny-executor#31 for the executor-side change.

Refs: dev_docs/20260420_3/plan/02_default_env_per_role.md (PR 3)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CocoRoF added a commit to CocoRoF/Geny that referenced this pull request Apr 20, 2026
…150)

Fills in build_default_manifest.stages (previously an empty list with
a "filled in by a later PR" comment) with StageManifestEntry objects
that mirror the worker_adaptive and vtuber GenyPresets stage chains.
Also bumps the executor pin to >=0.25.0,<0.26.0 so manifest-restore
can resolve the binary_classify evaluator strategy that
worker_adaptive depends on.

### What the manifest carries

- Full active stage list per preset (worker_adaptive: 1–9 + 12, 13,
  15, 16; vtuber: same minus Stage 8/Think).
- Per-stage artifact name ("default" throughout) and strategy slot
  selections that exactly match the preset builder's output (e.g.
  cache strategy "aggressive_cache" for worker vs "system_cache" for
  vtuber, evaluator "binary_classify" vs "signal_based").
- Static configs: loop.max_turns (30 for worker, 10 for vtuber),
  binary_classify.{easy_max_turns, not_easy_max_turns}.
- tools.built_in (Read/Write/Edit/Bash/Glob/Grep, shared constant),
  tools.external (plumbed through from the caller's whitelist).

### What the manifest does NOT carry (and why)

Three slots are declared with *default* strategies and are meant to
be overwritten by Pipeline.attach_runtime(...) at session start:

  - context.retriever   → GenyMemoryRetriever   (needs per-session memory_manager)
  - memory.strategy     → GenyMemoryStrategy    (needs memory_manager + llm_reflect + curated_km)
  - memory.persistence  → GenyPersistence       (needs memory_manager)

This matches the declarative-only principle called out in
dev_docs/20260420_3/plan/02_default_env_per_role.md: the manifest
expresses stage shape and static params, runtime-scoped Python
objects attach post-construction.

Stage 10 (tool) is not in the declarative list — the preset
registers it conditionally on `tools=` being passed, and the tool
registry is built from tools.external + adhoc_providers at
session-build time via Pipeline.from_manifest_async.

Stage 3 (system) declares builder="composable" to match the preset,
but the ComposablePromptBuilder's block list (PersonaBlock +
DateTimeBlock + MemoryContextBlock) is runtime state and isn't
encoded here. A later PR will expand attach_runtime to accept a
system builder; until then, session-build code will either set the
blocks itself or continue to fall back to the preset path.

### Parity verification

Ad-hoc smoke test (/tmp/test_manifest_parity.py) compares the
manifest-built pipeline to GenyPresets.{worker_adaptive, vtuber}
on stage orders, per-stage artifact, per-slot strategy name (minus
the three runtime-swapped slots), loop.max_turns, and the
binary_classify strategy's live config. All 17 assertions pass.

### Why bump to v0.25.0

v0.25.0 registers binary_classify in the default EvaluateStage's
strategy slot registry. Without it, worker_adaptive's manifest
silently restored to signal_based evaluation. See
CocoRoF/geny-executor#31 for the executor-side change.

Refs: dev_docs/20260420_3/plan/02_default_env_per_role.md (PR 3)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant