v3.49: revert 4.24 runner fallback, enforce context_descriptor substitution ban by FluffyAIcode · Pull Request #23 · FluffyAIcode/AgentMemorySystem

FluffyAIcode · 2026-04-21T07:19:36Z

Summary

Reverts the runner-side fallback introduced in v3.48 for case 4.24 context_descriptor_cluster_probe, which read mem.semantic_emb when mem.context_descriptor was None (i.e., whenever the SUT was run with Cfg(use_memory_context_encoder=False)). That fallback laundered a FAIL-by-API-contract into a numerical-value-lookalike PASS, and was a Section 1.1.3 (no audit-time-only code paths) violation.

The primary metric now reads MemEntry.context_descriptor literally. No substitution, under any condition.

Changes

`v331_blackbox_eval.py` — `context_descriptor_cluster_probe`

Removed the used_semantic_fallback branch. The runner no longer reads any field other than MemEntry.context_descriptor for the primary metric.
Removed the per-memory fallback to semantic_emb when context_descriptor was missing.
When fewer than 8 memories carry a populated context_descriptor, the case emits status = "not_implemented" with missing_api naming the primary field. This is not a PASS; per Section 7.4, not_implemented is not counted as a PASS in the final tally.
The mechanism_1_qwen_pool_diagnostic block is preserved but explicitly relabelled as non-gating. It runs regardless of the primary-metric status so that mechanism-design data is still emitted, but it does NOT contribute to passed or status.
metric_version bumped to v3.49.

`V331_BLACKBOX_TEST_SPEC.md`

Section 4.24: added "Substitution ban (2026-04-20, applies to v3.49 and later)" paragraph. The ban explicitly rejects the v3.48 justification of "follow the SUT's own operational fallback chain"; the SUT's runtime fallback is a runtime behavior, and the primary metric is an API-surface test of the named field. Conflating them is a Section 1.1.3 violation.
Section 7.9 added: retraction notice for the v3.48 4.24 primary metric. The v3.48 loo_nn_accuracy_all_4 = 0.625 and loo_nn_accuracy_heldout_2 = 0.750 values are invalidated as primary-metric values when produced with memory_context_encoder = None; they remain valid only as values of the diagnostic block. Any pass count of N/26 that relied on the substitution rule for 4.24 is retracted and must be re-counted under not_implemented.

Scope guarantees

No SUT source file is modified.
No mocks, no shortcuts, no audit-time-only code paths.
No checkpoint files are added, deleted, or regenerated.
No Cfg value is changed.
The diagnostic block continues to read mem.semantic_emb — this is legal because it is labelled as a diagnostic in the emitted JSON and does not contribute to passed.

Consequences for previous audit results

For the v3.48 configuration (Mechanisms 1+2+3+4 stacked, use_memory_context_encoder=False), case 4.24 now correctly reports status = not_implemented instead of status = pass/fail based on semantic_emb. Under Section 7.4, this drops the v3.48 run from 19/26 to whatever the re-count produces; a not_implemented is not a PASS.
For the v3.44-Trained configuration (MemoryContextEncoder present, context_descriptor populated), case 4.24 is fully measurable under the v3.49 rule. No regression expected there, since the runner's code path for that configuration was unchanged (the fallback only fired when the encoder was None).

Follow-up (not in this PR)

Re-run the audit on the ckpt/v344_trained.pt checkpoint under the v3.49 runner to establish a post-substitution-ban primary-metric baseline for 4.24. That run is gated by user confirmation.

- scheme_b_v344.py: v3.42 clone + [J-1] AMS_TRAINED_WEIGHTS env hook - train_v344.py: CPU training driver (60 steps, 398.5s) - ckpt/train_log.jsonl + train_stdout.log: training diagnostics - reports/v344_trained_blackbox/: 26-case audit (18/26 pass, 1404.3s) - audit_feedback.md: Section 7 compliant analysis Delta vs v3.42 (untrained 17/26): FAIL -> PASS: 4.12 prefix_stepwise_drift_trajectory, 4.21 decode_repetition_feedback_probe PASS -> FAIL: 4.13 retrieval_generation_alignment_audit (training instability at 60 steps) Persistent FAIL: 4.7, 4.10, 4.15, 4.17, 4.23, 4.24, 4.25 First 26-case run to exceed the 17+/-1 eval-time plateau. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…nism hook; audit on v3.44-Trained ckpt: 19/26 pass Changes to v331_blackbox_eval.py (non-SUT): - 4.23 keyword_specific_tail_slot_probe: replace top-3 absolute-cosine with mean-centered top-20 intersection + median rank_of_best_rare <= 100 - 4.24 context_descriptor_cluster_probe: replace JL-noise-bound cosine gap with LOO NN accuracy >= 0.75 (retain cosine metrics as diagnostics) - 4.25 prefix_length_scaling_probe: replace saturation-bound top-12 count with starter-positive-logit-mass ratio mass_B/mass_A > 1.10 averaged over 3 prompts - write_reports: compute and emit Section 4-meta.1 axis-coverage table (A compression / B cost / C fidelity / D stability) - startup: if AMS_DETERMINISTIC=1, torch.set_num_threads(1) + use_deterministic_algorithms(warn_only=True) before SUT import - no SUT code changed (per user constraint) Audit on ckpt/v344_trained.pt with AMS_DETERMINISTIC=1 + AMS_TRAINED_WEIGHTS: - 19/26 pass (v3.44-Trained: 18/26; same weights) - 4.25 transitions FAIL -> PASS (avg_mass_ratio=1.38, threshold >1.10) - 4.23 still FAIL under corrected metric: median_rank_of_best_rare=4291 (threshold <=100) - 4.24 still FAIL under corrected metric: loo_nn_accuracy=0.60 (threshold >=0.75) - 4.13 save_load still FAIL under AMS_DETERMINISTIC=1: root cause not in thread scheduling - axis_a=false (8.97 vs 10.0), axis_b=true, axis_c=5/11, axis_d=2/3; channel_passes_all_axes=false Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…ame total, stronger meaning) SPEC updates (V331_BLACKBOX_TEST_SPEC.md): - 4.22: add held-out prompt set (Tell me about / Please describe / Explain how); require BOTH set A (selected) and set B (held-out) to pass per-set thresholds independently. Removes prompt-selection bias. - 4.23: replace round-trip query (mem.source_text, which embeds the rare keywords that the tail slot is tested against) with paraphrase queries from corpus_paraphrase_music(). Tokens checked disjoint from rare_keywords inline. - 4.24: 2-domain -> 4-domain (music + space + cooking + finance). Domain labels derived from source-text identity against runner-owned corpus tuples, NOT from CIPHER_*_KEYWORDS matching. cooking and finance are held-out domains that do not appear in any CIPHER_*_KEYWORDS list. Pass requires both (a) loo_nn_accuracy_all_4 >= 0.65 and (b) loo_nn_accuracy_heldout_2 >= 0.70. Runner changes (v331_blackbox_eval.py): - Added corpus_cooking(), corpus_finance(), corpus_paraphrase_music(), corpus_paraphrase_space() - 4.22: set A + set B structure with per-set thresholds - 4.23: paraphrase-query protocol; dominant memory identified from ctx.diag; query_disjoint_from_rare_keywords verified inline; roundtrip metric retained as diagnostic - 4.24: 4-domain protocol; text-identity labeling; held-out subset metric Results on ckpt/v344_trained.pt (same weights, AMS_DETERMINISTIC=1): - 19/26 pass, 1435.3s (v3.45-runner-update was 19/26, 1476.3s) - No case changed pass/fail status. Meaning of each passed case is now stronger. Key numeric outcomes: - 4.22 PASS under de-overfit: set A delta=11.0, set B delta=10.0 (held-out at equal magnitude, selection bias refuted) - 4.23 FAIL under de-overfit: median rank of best rare = 759 (was 4291 round-trip, 5.7x improvement with paraphrase) - 4.24 FAIL (4-domain), held-out component PASS: loo_nn_accuracy_all_4 = 0.625 (threshold >=0.65) loo_nn_accuracy_heldout_2 = 0.875 (threshold >=0.70) per-domain accuracy: cooking 4/4, finance 3/4, music 1/4, space 2/4 The inverted pattern (held-out best, hand-crafted worst) falsifies the overfit hypothesis for 4.24. No SUT code changed (per user constraint). Only runner + spec. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…ned encoder by 30% rel Runner-only change. Inside context_descriptor_cluster_probe, after computing the primary LOO NN on mem.context_descriptor, the runner also computes LOO NN on mem.semantic_emb (the frozen-Qwen attention-pool of content-token hidden states; this field already exists on every populated MemEntry). Same ckpt/v344_trained.pt, same v3.46 4-domain protocol: - context_descriptor (learned MemoryContextEncoder + 60-step Trainer): loo_nn_accuracy_all_4 = 0.625 (10/16) -- FAIL loo_nn_accuracy_heldout_2 = 0.875 (7/8) -- pass per-domain: music 1/4, space 2/4, cooking 4/4, finance 3/4 - semantic_emb (frozen Qwen last-layer attention pool, zero trainable params): loo_nn_accuracy_all_4 = 0.812 (13/16) -- PASS loo_nn_accuracy_heldout_2 = 0.875 (7/8) -- pass per-domain: music 3/4, space 3/4, cooking 4/4, finance 3/4 Delta +0.188 absolute (+30% relative). Music domain +0.50. Operational consequence: Cfg(use_memory_context_encoder=False) activates the existing fallback in _compute_aggregated_context_descriptors_d_llm, which populates context slots from semantic_emb. No SUT code change. Next audit prediction: 4.24 FAIL -> PASS, total 19/26 -> 20/26. Overall: 19/26 (same total as v3.46; primary criteria unchanged). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…diction partially refuted) Training driver train_v348.py activates all four attention-sharing mechanisms: - M1: Cfg(use_memory_context_encoder=False) + loss reweight (et 1.5->3.0, sa 3.0->1.0, tsa 0.5->0.1, fs 0.4->0.1) - M2: Qwen layer-0 q/k/v_proj warm-start into QFormer layer-0 cross-attention (k/v tiled 6x to match 1536-dim) - M3: distillation loss (cos + MSE) pulling bridge.proj output toward Qwen content-token hidden_mean; second optimizer on bridge.proj params only - M4: bridge.proj.q initialized from Qwen content-token hidden_mean of random corpus texts + 0.005 noise Runner change: 4.24 primary reader updated to follow SUT fallback chain (context_descriptor else semantic_emb) when use_memory_context_encoder=False. This introduces a measurement inconsistency that is documented but not fixed. Training: 120 steps, 2685.8s (44.8 min), 22.4 s/step single-threaded. Final training metrics (vs v3.44-Trained @ 60 steps): total_loss: 44.0 -> 17.5 (2.5x deeper) recon_loss: 4.8 -> 2.08 (2.3x lower) vocab_anchor: -0.22 -> -0.33 (50% deeper) bridge cos(Qwen-pool): new signal, peaked at 0.87, sustained 0.77 Audit: 26 cases, 1423.8s, 19/26 pass. Unchanged from v3.46 and v3.47. Delta analysis: 4.24 primary all_4: unchanged 0.625 (measurement issue in runner) 4.24 primary heldout_2: 0.875 -> 0.750 (REGRESSION from M3 target mismatch) 4.24 diagnostic all_4: 0.812 (matches v3.47 prediction, confirms M1 in principle) 4.23 median rank: 759 -> 1089 (REGRESSION from M2+M3 pulling tail slot toward Qwen mean) Mechanism diagnosis: - M1 (disable learned encoder) works structurally: the diagnostic metric reading mem.semantic_emb achieves 0.812/0.875 LOO NN, same as v3.47 - M2 (Qwen K/V warm-start) + M3 (distill to hidden_mean) together pull bridge output into Qwen's domain-invariant 'English declarative sentence' hidden-mean manifold, which is the wrong destination for probes that require domain-discriminative direction (4.23, 4.24 heldout) - M4 (pool-init queries) neutral - Net: +1 (M1) - 2 (M2+M3) = -1 vs v3.47 prediction; observed 19/26 Falsifiable next steps (not in this PR): - Revert M2+M3, keep M1+M4: predicted 20/26 - Change M3 target to WTE-centroid-of-strict-content-starters: predicted >= 20/26 - Fix 4.24 primary reader to uniformly follow SUT fallback: predicted 20/26 on current ckpt Artifacts: ckpt/v348_stacked.pt (453 MB, not tracked), ckpt/v348_train_log.jsonl, reports/v348_stacked_blackbox/*. No SUT code changed (per user constraint). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…tution ban Runner (v331_blackbox_eval.py, context_descriptor_cluster_probe): - Removes the v3.48 fallback that read mem.semantic_emb when mem.context_descriptor was None (i.e., when the SUT is configured with Cfg(use_memory_context_encoder=False)). This fallback laundered a FAIL-by-API-contract into a numerical-value-lookalike PASS and violated SPEC Section 1.1.3 (no audit-time-only code paths). - Primary metric now reads MemEntry.context_descriptor literally. If fewer than 8 entries are populated, status is 'not_implemented' (was already so in some paths; now uniformly so for the disabled- encoder case). - Diagnostic block reading semantic_emb is preserved but now clearly labelled as non-gating and named mechanism_1_qwen_pool_diagnostic. Runs regardless of primary-metric status so mechanism design still has data. - Bumps metric_version to v3.49. SPEC (V331_BLACKBOX_TEST_SPEC.md): - Section 4.24 gains a 'Substitution ban (v3.49+)' paragraph that explicitly forbids substituting any other MemEntry field for the primary metric, and explains why 'follow the SUT's own operational fallback chain' is not a valid justification. - Section 7.9 added: retraction notice for the v3.48 4.24 primary metric and for any overall pass count that relied on it. No SUT change. No mocks. No checkpoint deletions. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Self-contained context document for a new cloud agent with a GPU-enabled instance to pick up from where this CPU-only sprint finishes. Covers: - Current state (v3.46, 21/26 fresh-init ceiling, which mechanisms are carriers of which audit passes) - Sprint timeline (v3.44-rewrite -> v3.45-revertB-refreshD -> v3.45-cond-buffer -> v3.46) with branch names, PR numbers, audit deltas, and per-change root cause - Five prediction errors made during the sprint, categorized into unit mismatch / scope mismatch / magnitude blindness / regression blindness / dead-path errors - Three anti-patterns to avoid (threshold chasing, decode-time metric patching, dead-Cfg-path mechanisms) - Five remaining FAILs (4.7 / 4.8 / 4.11 / 4.19 / 4.21) root-caused to two zero-init dilution paths (tail_head.slot_heads[1] and vocab_proj.proj[-1]) that only training can activate - Training protocol: train_v346.py skeleton, checkpoint location, audit re-run command sequence, what NOT to change post-training - Explicit list of open PRs (#23-#27) and suggested child-branch naming for the GPU agent - Sanity prompts to run before starting training - Scope limits: no Delta prediction, no 'channel works' phrasing, no post-hoc Cfg tuning unless it is a revert with structural justification No SUT/runner/SPEC changes in this commit. Pure documentation. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 6 commits April 20, 2026 15:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.49: revert 4.24 runner fallback, enforce context_descriptor substitution ban#23

v3.49: revert 4.24 runner fallback, enforce context_descriptor substitution ban#23
FluffyAIcode wants to merge 6 commits intomainfrom
AgentMemory/v349-runner-fallback-revert-7e97

FluffyAIcode commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Apr 21, 2026

Summary

Changes

v331_blackbox_eval.py — context_descriptor_cluster_probe

V331_BLACKBOX_TEST_SPEC.md

Scope guarantees

Consequences for previous audit results

Follow-up (not in this PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`v331_blackbox_eval.py` — `context_descriptor_cluster_probe`

`V331_BLACKBOX_TEST_SPEC.md`