feat(gsp-diagnostics): surface label, squared error, and GSP loss per step by jdbloom · Pull Request #11 · NESTLab/RL-CollectiveTransport

jdbloom · 2026-04-13T00:19:49Z

Summary

Wires the GSP information-collapse diagnostic signals from the training loop into the HDF5 logger.

Core change (`c699675` — this is the review target):

`calculate_gsp_reward` now returns `(reward, label, squared_errors)` — the raw per-robot (diff − prediction)² alongside the existing clipped reward
`Main.py` forwards `label` (broadcast as per-robot `gsp_target`), `squared_error`, and `model.last_gsp_loss` to the HDF5Logger each tick and learning step

Ancestor commits in this PR (pre-existing work on `feat/learn-every-n-steps` not yet on origin — my commit depends on `h5_logger` existing, so they must ship together):

`0e462aa` feat(main): use HDF5Logger as sole data writer, remove pkl writes
`1b1ee60` feat(agent): accept ROBOT_ORDER param for R-GSP-N bias characterization
`ba76caa` feat(dispatcher): parameterize make_config + add algorithm_defaults
`ca7d5ca` feat(ingestion): add h5_logger.close() call and --experiment_name arg

Please focus review on commit `c699675` only. The four ancestor commits are existing in-progress work and should be reviewed separately if they haven't been.

Why

The clipped `gsp_reward` saturates at -2 and hides the magnitude of large prediction errors. The raw squared error carries the signal needed to detect degenerate GSP predictions. Combined with `Actor.last_gsp_loss` from the companion GSP-RL PR, this lets the HDF5 logger capture enough to diagnose GSP information collapse at scale.

See `docs/specs/2026-04-12-dispatcher-diagnostic-batch.md` in Stelaris for the information-collapse hypothesis these fields exist to test.

Companion PRs (must land together)

`jdbloom/stelaris#5` — HDF5Logger accepts the new fields and computes the two episode-level collapse signatures
`feat(actor): expose per-step GSP prediction loss via last_gsp_loss GSP-RL#22` — `Actor.last_gsp_loss` attribute

Test plan

6 new test cases in `tests/test_env/test_gsp_reward.py` (TDD — watched them fail first)
All existing `calculate_gsp_reward` tests updated to 3-tuple unpacking (8 tests, still pass)
Full RL-CT test suite: 95/95 pass (excluding `test_nan_guards.py` — pre-existing unrelated import error)
Main.py syntax-check after Edit
Broadcasting `label` per robot aligns with `(timesteps × robots)` HDF5 schema

🤖 Generated with Claude Code

Adds --experiment_name argparse argument and calls h5_logger.close() at end of training to emit FINAL sentinel to the ingestion worker (PR2). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- algorithm_defaults.py: per-algorithm field merges; DQN/DDQN/DDPG/TD3 - make_config: accepts algorithm kwarg, calls merge_algorithm_defaults (spec §9) Part of experiment dispatcher PR5 (spec §§5.5, 9, 14.6) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Agent.__init__ reads ROBOT_ORDER (fixed|randomized) and SEED from config - choose_agent_gsp shuffles per-robot iteration order with seeded RNG when ROBOT_ORDER=randomized, enabling controlled bias comparison (spec §14.3) - Fixed order is the default, preserving existing behavior exactly Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Instantiate HDF5Logger at training start (one episodes.h5 per experiment, co-located with Data/ and Models/ under recording_path) - Replace data_writer.writerow() with h5_logger.writerow() (signatures match) - Replace data_writer.write_to_file() with h5_logger.write_episode() — fires notify_episode sentinel on the ingestion FIFO - close() at end of training fires notify_final for test auto-enqueue Pickle-based data_logger import kept for backwards-compat reads but never writes. Eliminates the disk-full risk from 500 episodes × 3 MB pkl files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… step calculate_gsp_reward now returns (reward, label, squared_errors) so the raw prediction error is available alongside the clipped reward. Main.py forwards label (broadcast as per-robot gsp_target), squared_error, and model.last_gsp_loss to the HDF5Logger each tick and learning step. These are the fields needed to detect GSP information collapse: raw squared error carries the magnitude beyond the reward's -2 saturation, and the GSP network's training loss exposes degenerate learning that the actor/critic loss would hide. See Stelaris docs/specs/2026-04-12-dispatcher-diagnostic-batch.md for the hypothesis. Requires companion changes in GSP-RL (feature/hdf5-gsp-diagnostics, commit e1b138d) and Stelaris HDF5Logger (same branch). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jdbloom · 2026-04-13T00:22:25Z

Review of `c699675` (wiring layer only)

Verdict: approve with one semantic concern and two suggestions. The env.py 3-tuple refactor is clean, the squared-error math is correct, tests are faithful tuple-unpack updates, and there are no other callers of calculate_gsp_reward to break. My main concern is the independent-learning record_gsp_loss cardinality.

Concerns

1. record_gsp_loss called N times per learn step in independent-learning mode — Main.py:519-522
Inside for i in range(num_robots): models[i].learn(), record_gsp_loss fires once per robot model, so the Stelaris gsp_loss dataset (1D per learning step) receives N entries for a single tick's learn step. In shared-model mode it gets 1 entry. This means the gsp_loss axis is not comparable across --independent_learning runs vs. shared runs, and its length is not num_learn_steps but num_learn_steps * num_robots in the independent case. Flag as intended but document, or aggregate (mean) before recording. Recommend adding a comment at 519 stating the cardinality, or wrapping the N calls into a single record_gsp_loss(mean([...])) / record_gsp_loss([...]) vector write so the schema stays 1:1 with learn ticks.

2. getattr(model, \"last_gsp_loss\", None) fallback — Main.py:520, 525
The defensive default silently no-ops if the GSP-RL submodule is not on the companion branch. That is a deliberate decoupling from the companion PR, which is fine for rollout, but once both PRs land it masks a real regression (e.g. someone refactors the GSP trainer and drops the attribute — the diagnostic goes dark with no error). Suggest either (a) a one-time logger.warning on first miss, or (b) after both PRs merge, switch to direct attribute access and let it raise. Low priority.

3. float(label) cast — Main.py:570
Safe: label = np.clip(diff*100, -1, 1) returns a 0-d ndarray / np.floating, both of which float() handles. No issue. Broadcast semantics ([float(label)] * num_robots) are semantically correct — the payload delta-theta is a single environment quantity all robots predict against, so per-robot replication is the right alignment for the (timesteps × robots) HDF5 schema.

Verified non-issues

abs(reward)**2 at env.py:42 equals (diff - next_heading_gsp[i])**2. Correct quantity for gsp_squared_error.
num_robots is initialized from config at env.py:101 during startup, long before the main loop — no ep0/tick0 race.
calculate_gsp_reward has exactly one caller (Main.py:332); no other breakage.
The 8 test updates are pure rewards, label = ... -> rewards, label, _ = ... tuple-unpack changes. Spot-checked test_reward_clipped_at_minus_2, test_reward_per_robot, test_wraparound_angles — assertions unchanged.
The 6 new TestGSPSquaredErrorReturn tests are solid, especially test_squared_error_is_unclipped which pins the motivating property (raw error = 9.0 while reward saturates at -2).

Suggestion

Consider deleting the commented-out dead code block at env.py:39-44 in the old diff context (dot-product alternative reward formulation). It's already mostly gone in this commit — finish the cleanup.

Overall: ship it after deciding on concern #1 (document or aggregate). The diagnostic data will be correct at the source modulo that schema cardinality question.

…ependent mode In --independent_learning, record_gsp_loss previously fired once per robot model inside the per-robot learn loop, so the 1D gsp_loss dataset received num_robots entries per learn tick instead of one. That made the information-collapse diagnostic's gsp_loss axis length differ between independent-learning mode (num_learn_steps × num_robots) and shared-model mode (num_learn_steps), breaking cross-mode comparability. Move to one-call-per-tick: run all per-robot learn() steps first, then collect last_gsp_loss from each model and record the mean. Shared-model branch is unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jdbloom · 2026-04-13T00:31:38Z

Second-pass review — commit `84c6203`

Verdict: fix lands correctly. Ready to merge.

1. Independent-learning fix verified

rl_code/Main.py lines 516-530:

Aggregation runs after all per-robot learn() calls complete (loop on 523-524 finishes before the comprehension on 525-528). No stale-value risk.
[m.last_gsp_loss for m in models if getattr(m, "last_gsp_loss", None) is not None] correctly handles the partial-fire case (some robots stepped GSP, others didn't). Mean-of-nonzero-subset is the right semantic — it's a per-tick scalar diagnostic, and averaging only over models that actually computed a loss avoids dragging the mean toward zero with stale/None fills. Agreed.
if gsp_losses: guard prevents np.mean([]) RuntimeWarning/nan. Correct.
getattr(..., None) fallback matches shared-model branch pattern (line 533). Consistent.
float(np.mean(...)) cast is belt-and-suspenders (h5 logger arrays as float32 anyway). Harmless.

2. No regression in shared-model branch

Lines 531-535 untouched. Primary diagnostic path intact.

3. Commit scope clean

git show --stat 84c6203: 1 file, rl_code/Main.py, +12/-4. No collateral. No new imports needed — numpy as np already imported at line 28.

Red flags checked

loss downstream semantics (line 581 writerow): unchanged. In independent mode loss still holds the last robot's learn() return after the loop — same behavior as before the fix. Not affected.
last_gsp_loss read timing: reads happen immediately after the learn loop, before any episode-end hidden-state reset (line 590+). Safe.
models definedness: the args.independent_learning branch is the only caller of models[...], and models is populated unconditionally in that mode upstream. No new branch risk introduced.
No test changes: expected — integration path needs ARGoS.

Cardinality asymmetry from first-pass review is resolved. Ship it.

Joshua Bloom and others added 5 commits April 11, 2026 21:15

feat(ingestion): add h5_logger.close() call and --experiment_name arg

ca7d5ca

Adds --experiment_name argparse argument and calls h5_logger.close() at end of training to emit FINAL sentinel to the ingestion worker (PR2). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jdbloom merged commit 57ada5f into feat/learn-every-n-steps Apr 13, 2026

jdbloom deleted the feature/hdf5-gsp-diagnostics branch April 13, 2026 00:32

jdbloom mentioned this pull request Apr 13, 2026

feat(gsp-diagnostics): add information-collapse logging for GSP variants #12

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gsp-diagnostics): surface label, squared error, and GSP loss per step#11

feat(gsp-diagnostics): surface label, squared error, and GSP loss per step#11
jdbloom merged 6 commits intofeat/learn-every-n-stepsfrom
feature/hdf5-gsp-diagnostics

jdbloom commented Apr 13, 2026

Uh oh!

jdbloom commented Apr 13, 2026

Uh oh!

jdbloom commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jdbloom commented Apr 13, 2026

Summary

Why

Companion PRs (must land together)

Test plan

Uh oh!

jdbloom commented Apr 13, 2026

Review of c699675 (wiring layer only)

Concerns

Verified non-issues

Suggestion

Uh oh!

jdbloom commented Apr 13, 2026

Second-pass review — commit 84c6203

1. Independent-learning fix verified

2. No regression in shared-model branch

3. Commit scope clean

Red flags checked

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Review of `c699675` (wiring layer only)

Second-pass review — commit `84c6203`