feat(actor): expose per-step GSP prediction loss via last_gsp_loss#22
feat(actor): expose per-step GSP prediction loss via last_gsp_loss#22jdbloom merged 2 commits intofeat/rddpg-lstm-fixfrom
Conversation
The GSP prediction network's training loss was never surfaced through Actor.learn(). Only the actor/critic loss was returned, which stays normal even when the GSP head collapses to a near-constant output. Add last_gsp_loss attribute populated by learn_gsp() whenever a GSP learning step fires, reset to None at the start of each learn() call so callers can distinguish "no GSP step this tick" from "GSP step ran". Needed for the information-collapse diagnostic (see Stelaris docs/specs/2026-04-12-dispatcher-diagnostic-batch.md) — without it we cannot tell whether non-recurrent GSP variants are learning or degenerate. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verdict: Approve with minor concerns (non-blocking)The loss-capture path is correct, the reset timing is right, the TD3 tuple handling is sound, and there are no attribute collisions or Python-version issues ( Concerns1. Semantic mismatch between scheme branches ( 2. TD3 3. Pre-existing TD3 GSP signature bug ( Test gaps (suggestions, not blockers)
Nits
|
…reset test - learn_TD3 now accepts recurrent=False to match the DDPG/RDDPG signatures; the learn_gsp dispatch was passing 3 positional args to a 2-arg method. Latent bug today (GSP networks are built as DDPG/attention, not TD3) but removes the footgun before the diagnostic batch exercises TD3 variants. - TD3's non-actor-update step returns (0, 0); previously we unwrapped to 0.0 and logged it. That produces false collapse signals every update_actor_iter-1 ticks. Now we skip the entry entirely — leave last_gsp_loss at None as if no GSP step ran. - Doc the semantic: last_gsp_loss is the GSP learner's training loss, which is actor loss (policy-gradient signal) for DDPG/RDDPG/TD3 and genuine MSE only for attention. For prediction-collapse detection consumers should rely on gsp_squared_error and the HDF5Logger episode-level gsp_output_std / gsp_pred_target_corr attrs. - Add reset-between-ticks test covering the load-bearing invariant that last_gsp_loss returns to None when a learn() call runs but no GSP learning step fires. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Second-pass review of e7f0a55 — verdict: ready to mergeAll four fixes land cleanly. Verified against the tree at Per-concern verification#3 TD3 signature (
#2 TD3 tuple path (
#1 Docstring (
#4 Reset test — valid probe.
New issues introduced by e7f0a55None. No new typos, no None-deref risk, no dead branches. The three original tests in RecommendationReady to merge. All review concerns addressed correctly, no regressions introduced. |
…reset test - learn_TD3 now accepts recurrent=False to match the DDPG/RDDPG signatures; the learn_gsp dispatch was passing 3 positional args to a 2-arg method. Latent bug today (GSP networks are built as DDPG/attention, not TD3) but removes the footgun before the diagnostic batch exercises TD3 variants. - TD3's non-actor-update step returns (0, 0); previously we unwrapped to 0.0 and logged it. That produces false collapse signals every update_actor_iter-1 ticks. Now we skip the entry entirely — leave last_gsp_loss at None as if no GSP step ran. - Doc the semantic: last_gsp_loss is the GSP learner's training loss, which is actor loss (policy-gradient signal) for DDPG/RDDPG/TD3 and genuine MSE only for attention. For prediction-collapse detection consumers should rely on gsp_squared_error and the HDF5Logger episode-level gsp_output_std / gsp_pred_target_corr attrs. - Add reset-between-ticks test covering the load-bearing invariant that last_gsp_loss returns to None when a learn() call runs but no GSP learning step fires. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Actor.last_gsp_lossattribute populated bylearn_gsp()each time a GSP learning step firesNoneat the start of eachlearn()call so callers can distinguish "no GSP step this tick" from "GSP step ran"(0, 0)edge case is normalized to a scalarWhy
The primary loss returned from
Actor.learn()is the actor/critic loss, which stays completely normal even when the GSP prediction head has collapsed to a near-constant output. Without surfacing the GSP network's own training loss, we cannot diagnose the "information collapse" hypothesis called out in the Memory-Enhanced GSP paper outline (`Revamped Reward structure for GSP to prevent information collapse`). See companion PR in Stelaris / RL-CollectiveTransport (feature/hdf5-gsp-diagnostics).Test plan
🤖 Generated with Claude Code