Skip to content

feat(gsp-diagnostics): surface label, squared error, and GSP loss per step#11

Merged
jdbloom merged 6 commits intofeat/learn-every-n-stepsfrom
feature/hdf5-gsp-diagnostics
Apr 13, 2026
Merged

feat(gsp-diagnostics): surface label, squared error, and GSP loss per step#11
jdbloom merged 6 commits intofeat/learn-every-n-stepsfrom
feature/hdf5-gsp-diagnostics

Conversation

@jdbloom
Copy link
Copy Markdown
Collaborator

@jdbloom jdbloom commented Apr 13, 2026

Summary

Wires the GSP information-collapse diagnostic signals from the training loop into the HDF5 logger.

Core change (`c699675` — this is the review target):

  • `calculate_gsp_reward` now returns `(reward, label, squared_errors)` — the raw per-robot (diff − prediction)² alongside the existing clipped reward
  • `Main.py` forwards `label` (broadcast as per-robot `gsp_target`), `squared_error`, and `model.last_gsp_loss` to the HDF5Logger each tick and learning step

Ancestor commits in this PR (pre-existing work on `feat/learn-every-n-steps` not yet on origin — my commit depends on `h5_logger` existing, so they must ship together):

  • `0e462aa` feat(main): use HDF5Logger as sole data writer, remove pkl writes
  • `1b1ee60` feat(agent): accept ROBOT_ORDER param for R-GSP-N bias characterization
  • `ba76caa` feat(dispatcher): parameterize make_config + add algorithm_defaults
  • `ca7d5ca` feat(ingestion): add h5_logger.close() call and --experiment_name arg

Please focus review on commit `c699675` only. The four ancestor commits are existing in-progress work and should be reviewed separately if they haven't been.

Why

The clipped `gsp_reward` saturates at -2 and hides the magnitude of large prediction errors. The raw squared error carries the signal needed to detect degenerate GSP predictions. Combined with `Actor.last_gsp_loss` from the companion GSP-RL PR, this lets the HDF5 logger capture enough to diagnose GSP information collapse at scale.

See `docs/specs/2026-04-12-dispatcher-diagnostic-batch.md` in Stelaris for the information-collapse hypothesis these fields exist to test.

Companion PRs (must land together)

Test plan

  • 6 new test cases in `tests/test_env/test_gsp_reward.py` (TDD — watched them fail first)
  • All existing `calculate_gsp_reward` tests updated to 3-tuple unpacking (8 tests, still pass)
  • Full RL-CT test suite: 95/95 pass (excluding `test_nan_guards.py` — pre-existing unrelated import error)
  • Main.py syntax-check after Edit
  • Broadcasting `label` per robot aligns with `(timesteps × robots)` HDF5 schema

🤖 Generated with Claude Code

Joshua Bloom and others added 5 commits April 11, 2026 21:15
Adds --experiment_name argparse argument and calls h5_logger.close() at
end of training to emit FINAL sentinel to the ingestion worker (PR2).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- algorithm_defaults.py: per-algorithm field merges; DQN/DDQN/DDPG/TD3
- make_config: accepts algorithm kwarg, calls merge_algorithm_defaults (spec §9)

Part of experiment dispatcher PR5 (spec §§5.5, 9, 14.6)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Agent.__init__ reads ROBOT_ORDER (fixed|randomized) and SEED from config
- choose_agent_gsp shuffles per-robot iteration order with seeded RNG when
  ROBOT_ORDER=randomized, enabling controlled bias comparison (spec §14.3)
- Fixed order is the default, preserving existing behavior exactly

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Instantiate HDF5Logger at training start (one episodes.h5 per experiment,
  co-located with Data/ and Models/ under recording_path)
- Replace data_writer.writerow() with h5_logger.writerow() (signatures match)
- Replace data_writer.write_to_file() with h5_logger.write_episode() — fires
  notify_episode sentinel on the ingestion FIFO
- close() at end of training fires notify_final for test auto-enqueue

Pickle-based data_logger import kept for backwards-compat reads but never
writes. Eliminates the disk-full risk from 500 episodes × 3 MB pkl files.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… step

calculate_gsp_reward now returns (reward, label, squared_errors) so the raw
prediction error is available alongside the clipped reward. Main.py forwards
label (broadcast as per-robot gsp_target), squared_error, and model.last_gsp_loss
to the HDF5Logger each tick and learning step.

These are the fields needed to detect GSP information collapse: raw squared
error carries the magnitude beyond the reward's -2 saturation, and the GSP
network's training loss exposes degenerate learning that the actor/critic
loss would hide. See Stelaris
docs/specs/2026-04-12-dispatcher-diagnostic-batch.md for the hypothesis.

Requires companion changes in GSP-RL (feature/hdf5-gsp-diagnostics,
commit e1b138d) and Stelaris HDF5Logger (same branch).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jdbloom
Copy link
Copy Markdown
Collaborator Author

jdbloom commented Apr 13, 2026

Review of c699675 (wiring layer only)

Verdict: approve with one semantic concern and two suggestions. The env.py 3-tuple refactor is clean, the squared-error math is correct, tests are faithful tuple-unpack updates, and there are no other callers of calculate_gsp_reward to break. My main concern is the independent-learning record_gsp_loss cardinality.

Concerns

1. record_gsp_loss called N times per learn step in independent-learning modeMain.py:519-522
Inside for i in range(num_robots): models[i].learn(), record_gsp_loss fires once per robot model, so the Stelaris gsp_loss dataset (1D per learning step) receives N entries for a single tick's learn step. In shared-model mode it gets 1 entry. This means the gsp_loss axis is not comparable across --independent_learning runs vs. shared runs, and its length is not num_learn_steps but num_learn_steps * num_robots in the independent case. Flag as intended but document, or aggregate (mean) before recording. Recommend adding a comment at 519 stating the cardinality, or wrapping the N calls into a single record_gsp_loss(mean([...])) / record_gsp_loss([...]) vector write so the schema stays 1:1 with learn ticks.

2. getattr(model, \"last_gsp_loss\", None) fallback — Main.py:520, 525
The defensive default silently no-ops if the GSP-RL submodule is not on the companion branch. That is a deliberate decoupling from the companion PR, which is fine for rollout, but once both PRs land it masks a real regression (e.g. someone refactors the GSP trainer and drops the attribute — the diagnostic goes dark with no error). Suggest either (a) a one-time logger.warning on first miss, or (b) after both PRs merge, switch to direct attribute access and let it raise. Low priority.

3. float(label) cast — Main.py:570
Safe: label = np.clip(diff*100, -1, 1) returns a 0-d ndarray / np.floating, both of which float() handles. No issue. Broadcast semantics ([float(label)] * num_robots) are semantically correct — the payload delta-theta is a single environment quantity all robots predict against, so per-robot replication is the right alignment for the (timesteps × robots) HDF5 schema.

Verified non-issues

  • abs(reward)**2 at env.py:42 equals (diff - next_heading_gsp[i])**2. Correct quantity for gsp_squared_error.
  • num_robots is initialized from config at env.py:101 during startup, long before the main loop — no ep0/tick0 race.
  • calculate_gsp_reward has exactly one caller (Main.py:332); no other breakage.
  • The 8 test updates are pure rewards, label = ... -> rewards, label, _ = ... tuple-unpack changes. Spot-checked test_reward_clipped_at_minus_2, test_reward_per_robot, test_wraparound_angles — assertions unchanged.
  • The 6 new TestGSPSquaredErrorReturn tests are solid, especially test_squared_error_is_unclipped which pins the motivating property (raw error = 9.0 while reward saturates at -2).

Suggestion

Consider deleting the commented-out dead code block at env.py:39-44 in the old diff context (dot-product alternative reward formulation). It's already mostly gone in this commit — finish the cleanup.

Overall: ship it after deciding on concern #1 (document or aggregate). The diagnostic data will be correct at the source modulo that schema cardinality question.

…ependent mode

In --independent_learning, record_gsp_loss previously fired once per robot
model inside the per-robot learn loop, so the 1D gsp_loss dataset received
num_robots entries per learn tick instead of one. That made the
information-collapse diagnostic's gsp_loss axis length differ between
independent-learning mode (num_learn_steps × num_robots) and shared-model
mode (num_learn_steps), breaking cross-mode comparability.

Move to one-call-per-tick: run all per-robot learn() steps first, then
collect last_gsp_loss from each model and record the mean. Shared-model
branch is unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jdbloom
Copy link
Copy Markdown
Collaborator Author

jdbloom commented Apr 13, 2026

Second-pass review — commit 84c6203

Verdict: fix lands correctly. Ready to merge.

1. Independent-learning fix verified

rl_code/Main.py lines 516-530:

  • Aggregation runs after all per-robot learn() calls complete (loop on 523-524 finishes before the comprehension on 525-528). No stale-value risk.
  • [m.last_gsp_loss for m in models if getattr(m, "last_gsp_loss", None) is not None] correctly handles the partial-fire case (some robots stepped GSP, others didn't). Mean-of-nonzero-subset is the right semantic — it's a per-tick scalar diagnostic, and averaging only over models that actually computed a loss avoids dragging the mean toward zero with stale/None fills. Agreed.
  • if gsp_losses: guard prevents np.mean([]) RuntimeWarning/nan. Correct.
  • getattr(..., None) fallback matches shared-model branch pattern (line 533). Consistent.
  • float(np.mean(...)) cast is belt-and-suspenders (h5 logger arrays as float32 anyway). Harmless.

2. No regression in shared-model branch

Lines 531-535 untouched. Primary diagnostic path intact.

3. Commit scope clean

git show --stat 84c6203: 1 file, rl_code/Main.py, +12/-4. No collateral. No new imports needed — numpy as np already imported at line 28.

Red flags checked

  • loss downstream semantics (line 581 writerow): unchanged. In independent mode loss still holds the last robot's learn() return after the loop — same behavior as before the fix. Not affected.
  • last_gsp_loss read timing: reads happen immediately after the learn loop, before any episode-end hidden-state reset (line 590+). Safe.
  • models definedness: the args.independent_learning branch is the only caller of models[...], and models is populated unconditionally in that mode upstream. No new branch risk introduced.
  • No test changes: expected — integration path needs ARGoS.

Cardinality asymmetry from first-pass review is resolved. Ship it.

@jdbloom jdbloom merged commit 57ada5f into feat/learn-every-n-steps Apr 13, 2026
@jdbloom jdbloom deleted the feature/hdf5-gsp-diagnostics branch April 13, 2026 00:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant