feat(s12_evaluate): register binary_classify in default stage (v0.25.0)#31
Merged
feat(s12_evaluate): register binary_classify in default stage (v0.25.0)#31
Conversation
Adds 'binary_classify' to the default EvaluateStage's strategy slot
registry so manifests with
strategies={"strategy": "binary_classify"}
resolve to a real BinaryClassifyEvaluation instance instead of
silently falling back to SignalBasedEvaluation.
Also gives BinaryClassifyEvaluation.configure(...) a working body
so strategy_configs flows through — easy_max_turns and
not_easy_max_turns land on the strategy's internal config at
manifest-restore time.
This unblocks Geny's manifest-first cutover (PR10 of the
20260420_3 cycle) where build_default_manifest.stages emits a
StageManifestEntry for worker_adaptive's Stage 12 with
strategies.strategy = "binary_classify". Without this change,
serializing worker_adaptive through an EnvironmentManifest silently
degraded it to signal-based evaluation.
No breaking changes. The adaptive artifact remains strategy-only
and its Python import path is unchanged. The default registry keeps
all three pre-existing strategies; binary_classify is additive.
Tests: tests/unit/test_binary_classify_manifest.py (6 new tests).
Full suite: 1029 passed, 18 skipped. Ruff + format clean.
Refs: Geny/dev_docs/20260420_3/plan/02_default_env_per_role.md
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CocoRoF
added a commit
to CocoRoF/Geny
that referenced
this pull request
Apr 20, 2026
Fills in build_default_manifest.stages (previously an empty list with
a "filled in by a later PR" comment) with StageManifestEntry objects
that mirror the worker_adaptive and vtuber GenyPresets stage chains.
Also bumps the executor pin to >=0.25.0,<0.26.0 so manifest-restore
can resolve the binary_classify evaluator strategy that
worker_adaptive depends on.
### What the manifest carries
- Full active stage list per preset (worker_adaptive: 1–9 + 12, 13,
15, 16; vtuber: same minus Stage 8/Think).
- Per-stage artifact name ("default" throughout) and strategy slot
selections that exactly match the preset builder's output (e.g.
cache strategy "aggressive_cache" for worker vs "system_cache" for
vtuber, evaluator "binary_classify" vs "signal_based").
- Static configs: loop.max_turns (30 for worker, 10 for vtuber),
binary_classify.{easy_max_turns, not_easy_max_turns}.
- tools.built_in (Read/Write/Edit/Bash/Glob/Grep, shared constant),
tools.external (plumbed through from the caller's whitelist).
### What the manifest does NOT carry (and why)
Three slots are declared with *default* strategies and are meant to
be overwritten by Pipeline.attach_runtime(...) at session start:
- context.retriever → GenyMemoryRetriever (needs per-session memory_manager)
- memory.strategy → GenyMemoryStrategy (needs memory_manager + llm_reflect + curated_km)
- memory.persistence → GenyPersistence (needs memory_manager)
This matches the declarative-only principle called out in
dev_docs/20260420_3/plan/02_default_env_per_role.md: the manifest
expresses stage shape and static params, runtime-scoped Python
objects attach post-construction.
Stage 10 (tool) is not in the declarative list — the preset
registers it conditionally on `tools=` being passed, and the tool
registry is built from tools.external + adhoc_providers at
session-build time via Pipeline.from_manifest_async.
Stage 3 (system) declares builder="composable" to match the preset,
but the ComposablePromptBuilder's block list (PersonaBlock +
DateTimeBlock + MemoryContextBlock) is runtime state and isn't
encoded here. A later PR will expand attach_runtime to accept a
system builder; until then, session-build code will either set the
blocks itself or continue to fall back to the preset path.
### Parity verification
Ad-hoc smoke test (/tmp/test_manifest_parity.py) compares the
manifest-built pipeline to GenyPresets.{worker_adaptive, vtuber}
on stage orders, per-stage artifact, per-slot strategy name (minus
the three runtime-swapped slots), loop.max_turns, and the
binary_classify strategy's live config. All 17 assertions pass.
### Why bump to v0.25.0
v0.25.0 registers binary_classify in the default EvaluateStage's
strategy slot registry. Without it, worker_adaptive's manifest
silently restored to signal_based evaluation. See
CocoRoF/geny-executor#31 for the executor-side change.
Refs: dev_docs/20260420_3/plan/02_default_env_per_role.md (PR 3)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4 tasks
CocoRoF
added a commit
to CocoRoF/Geny
that referenced
this pull request
Apr 20, 2026
…150) Fills in build_default_manifest.stages (previously an empty list with a "filled in by a later PR" comment) with StageManifestEntry objects that mirror the worker_adaptive and vtuber GenyPresets stage chains. Also bumps the executor pin to >=0.25.0,<0.26.0 so manifest-restore can resolve the binary_classify evaluator strategy that worker_adaptive depends on. ### What the manifest carries - Full active stage list per preset (worker_adaptive: 1–9 + 12, 13, 15, 16; vtuber: same minus Stage 8/Think). - Per-stage artifact name ("default" throughout) and strategy slot selections that exactly match the preset builder's output (e.g. cache strategy "aggressive_cache" for worker vs "system_cache" for vtuber, evaluator "binary_classify" vs "signal_based"). - Static configs: loop.max_turns (30 for worker, 10 for vtuber), binary_classify.{easy_max_turns, not_easy_max_turns}. - tools.built_in (Read/Write/Edit/Bash/Glob/Grep, shared constant), tools.external (plumbed through from the caller's whitelist). ### What the manifest does NOT carry (and why) Three slots are declared with *default* strategies and are meant to be overwritten by Pipeline.attach_runtime(...) at session start: - context.retriever → GenyMemoryRetriever (needs per-session memory_manager) - memory.strategy → GenyMemoryStrategy (needs memory_manager + llm_reflect + curated_km) - memory.persistence → GenyPersistence (needs memory_manager) This matches the declarative-only principle called out in dev_docs/20260420_3/plan/02_default_env_per_role.md: the manifest expresses stage shape and static params, runtime-scoped Python objects attach post-construction. Stage 10 (tool) is not in the declarative list — the preset registers it conditionally on `tools=` being passed, and the tool registry is built from tools.external + adhoc_providers at session-build time via Pipeline.from_manifest_async. Stage 3 (system) declares builder="composable" to match the preset, but the ComposablePromptBuilder's block list (PersonaBlock + DateTimeBlock + MemoryContextBlock) is runtime state and isn't encoded here. A later PR will expand attach_runtime to accept a system builder; until then, session-build code will either set the blocks itself or continue to fall back to the preset path. ### Parity verification Ad-hoc smoke test (/tmp/test_manifest_parity.py) compares the manifest-built pipeline to GenyPresets.{worker_adaptive, vtuber} on stage orders, per-stage artifact, per-slot strategy name (minus the three runtime-swapped slots), loop.max_turns, and the binary_classify strategy's live config. All 17 assertions pass. ### Why bump to v0.25.0 v0.25.0 registers binary_classify in the default EvaluateStage's strategy slot registry. Without it, worker_adaptive's manifest silently restored to signal_based evaluation. See CocoRoF/geny-executor#31 for the executor-side change. Refs: dev_docs/20260420_3/plan/02_default_env_per_role.md (PR 3) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two additive changes:
binary_classifyis now registered in the default Stage 12 (EvaluateStage) strategy slot alongsidesignal_based/criteria_based/agent_evaluation. A manifest withstrategies={\"strategy\": \"binary_classify\"}now resolves to a realBinaryClassifyEvaluationinstance instead of silently falling back toSignalBasedEvaluation.BinaryClassifyEvaluation.configure(config: dict)is given a working body, sostrategy_configs={\"strategy\": {\"easy_max_turns\": N, \"not_easy_max_turns\": M}}flows through at manifest-restore time.Why
Before v0.25.0,
binary_classifylived only in theadaptiveartifact module and was reachable only via the builder's.with_evaluate(strategy=BinaryClassifyEvaluation(...))kwarg. That meant serializing theworker_adaptivepreset through anEnvironmentManifestsilently degraded it to signal-based evaluation — its adaptive identity was lost the moment it passed through a manifest round-trip.Geny's manifest-first cutover (
dev_docs/20260420_3/plan/02_default_env_per_role.md) requiresbuild_default_manifest.stagesto emit the full preset layout declaratively. This PR makes that faithful forworker_adaptive.Backwards compatibility
No breaking changes. The
adaptiveartifact remains strategy-only; its Python import path (from geny_executor.stages.s12_evaluate.artifact.adaptive.strategy import BinaryClassifyEvaluation) is unchanged. The default registry keeps every pre-existing strategy — this PR is purely additive insideEvaluateStage.__init__.Test plan
tests/unit/test_binary_classify_manifest.py— 6 new testsruff check+ruff format --checkclean[0.25.0]entrypyproject.tomland__init__.py🤖 Generated with Claude Code