diff --git a/README.md b/README.md index 10b7a80..d0b81ac 100644 --- a/README.md +++ b/README.md @@ -93,6 +93,7 @@ IgnitionRL To author an environment from a blank TypeScript project, follow the first guide in [`docs/BUILD_YOUR_FIRST_ENVIRONMENT.md`](docs/BUILD_YOUR_FIRST_ENVIRONMENT.md). To turn a stored learner checkpoint into an inference run and replay, follow [`docs/EXPORT_AND_REPLAY_TRAINED_POLICY.md`](docs/EXPORT_AND_REPLAY_TRAINED_POLICY.md). +To debug reward shaping with named terms and replay frames, follow [`docs/REWARD_DEBUGGING_GUIDE.md`](docs/REWARD_DEBUGGING_GUIDE.md). After cloning and installing dependencies, generate a local project with traces, metrics and JSON exports: diff --git a/docs/REWARD_DEBUGGING_GUIDE.md b/docs/REWARD_DEBUGGING_GUIDE.md new file mode 100644 index 0000000..a76e07e --- /dev/null +++ b/docs/REWARD_DEBUGGING_GUIDE.md @@ -0,0 +1,212 @@ +# Reward Debugging Guide + +Reward debugging answers one question: did the agent receive the right signal +for the behavior you wanted? + +IgnitionRL records named reward terms in every trace. The CLI can inspect those +terms directly today, and the Studio shell reads the same exported JSON +artifacts for replay and reward panels. + +This guide uses `DroneTarget-v0` because it has a useful mix of shaping, +success, safety and time-cost terms: + +- `progress`; +- `target_reached`; +- `collision`; +- `out_of_bounds`; +- `step_penalty`. + +## 1. Create a Project With Failed and Improved Runs + +Run the current DroneTarget demo: + +```sh +bun run --cwd packages/cli start demo drone-target ./drone-target-demo.ignitionrl \ + --seed 42 \ + --random-episodes 2 \ + --heuristic-episodes 2 \ + --learner-episodes 2 \ + --inference-episodes 2 \ + --max-steps 12 \ + --json +``` + +The demo creates: + +- `drone-target-random`; +- `drone-target-heuristic`; +- `drone-target-linear-policy-search`; +- `drone-target-linear-policy-search-inference`. + +Use `compare` to see which run is better: + +```sh +bun run --cwd packages/cli start compare ./drone-target-demo.ignitionrl \ + --score-by summary.bestReward \ + --json +``` + +For reward debugging, start with a bad run and a better run. In this demo, +`drone-target-random` is the failed baseline and +`drone-target-linear-policy-search` is the improved trained run. + +## 2. Inspect Reward Terms + +Inspect the failed run: + +```sh +bun run --cwd packages/cli start rewards ./drone-target-demo.ignitionrl \ + drone-target-random \ + --step 0 \ + --export \ + --json +``` + +Inspect the improved run: + +```sh +bun run --cwd packages/cli start rewards ./drone-target-demo.ignitionrl \ + drone-target-linear-policy-search \ + --step 0 \ + --export \ + --json +``` + +The payload includes: + +- `termNames`: all reward terms found in the selected trace; +- `terms`: per-term total, min, max, active-step count and last value; +- `timeline`: every step with total reward, cumulative reward and term values; +- `selectedStep`: the requested step; +- `artifact`: the exported reward-debugger JSON path when `--export` is used. + +A healthy improved DroneTarget run usually shows positive `progress` over many +steps, a single `target_reached` bonus and only the expected `step_penalty`. +A failed random run often shows negative `progress` and no `target_reached` +bonus. + +## 3. Pair Rewards With Replay Frames + +Reward terms explain the numeric signal. Replay frames explain what the agent +was seeing and doing when it received that signal. + +Open the first frame of the failed run: + +```sh +bun run --cwd packages/cli start replay ./drone-target-demo.ignitionrl \ + drone-target-random \ + --frame 0 \ + --json +``` + +Open the final frame of the improved run: + +```sh +bun run --cwd packages/cli start replay ./drone-target-demo.ignitionrl \ + drone-target-linear-policy-search \ + --frame 11 \ + --json +``` + +Look at these fields together: + +- `selectedFrame.observation`; +- `selectedFrame.action`; +- `selectedFrame.rewardTerms`; +- `selectedFrame.reason`; +- `actionDistribution`; +- `observationDimensions`. + +The pattern to look for is simple: + +- if `progress` is negative, inspect whether the action moved away from the target; +- if `target_reached` is active, inspect whether the selected frame ended near the target; +- if `collision` or `out_of_bounds` is active, inspect the terminal frame and done reason; +- if `step_penalty` dominates, inspect whether the agent is taking too many steps without progress. + +## 4. Compare the Failed and Improved Runs + +Use `compare` for the summary view: + +```sh +bun run --cwd packages/cli start compare ./drone-target-demo.ignitionrl \ + --score-by summary.bestReward \ + --json +``` + +Then use `rewards` for attribution: + +```sh +bun run --cwd packages/cli start rewards ./drone-target-demo.ignitionrl \ + drone-target-random \ + --json + +bun run --cwd packages/cli start rewards ./drone-target-demo.ignitionrl \ + drone-target-linear-policy-search \ + --json +``` + +Good comparisons usually identify one of these: + +- the improved run has higher cumulative `progress`; +- the improved run reaches `target_reached` while the failed run never does; +- the failed run spends reward budget on `collision`, `out_of_bounds` or long step penalties; +- the action distribution changed from random-looking noise to a purposeful control pattern. + +## 5. Common Reward Bugs + +| Symptom | Likely bug | What to inspect | +| --- | --- | --- | +| Agent moves away from the target but reward is positive. | `progress` sign is inverted. | Compare `selectedFrame.observation` target-relative dimensions with `selectedFrame.rewardTerms.progress`. | +| Agent reaches the target but gets no bonus. | Target radius, success condition or `target_reached` condition does not match. | Inspect the terminal replay frame, `reason` and `target_reached.activeSteps`. | +| Training prefers crashing or leaving bounds. | Penalty is missing, too small or inactive. | Check `collision`, `out_of_bounds`, done `reason` and per-term totals. | +| Reward is dominated by time cost. | `step_penalty` is too large or progress reward is too weak. | Compare `step_penalty.total` with `progress.total` and episode `length`. | +| Reward terms appear only as one scalar. | Environment returned a raw scalar or unnamed aggregate instead of named terms. | Return `reward().add("term", value)` terms from the environment. | +| A term is always zero. | The condition is never true or the wrong state is used. | Inspect `activeSteps`, `lastValue` and whether the term uses `state` vs `nextState`. | +| Run succeeds in training but fails in inference. | Checkpoint policy, exploration settings or seed distribution differ. | Compare training run replay with checkpoint inference replay and reward terms. | + +## 6. CLI Flow to Studio Flow + +The current CLI commands map directly to Studio panels: + +| CLI command | Current artifact | Future Studio panel | +| --- | --- | --- | +| `compare` | project report and run history rows | Experiment history | +| `replay` | `episode-replay` JSON | Replay timeline and frame inspector | +| `rewards --export` | `reward-debugger` JSON | Reward attribution panel | +| `run --export` | `studio-run-view` JSON | Selected run detail | +| `studio --export` | `studio-workspace-view` JSON | Workspace bootstrap | + +Refresh the workspace after exporting reward debugger payloads: + +```sh +bun run --cwd packages/cli start studio ./drone-target-demo.ignitionrl \ + --run-id drone-target-linear-policy-search \ + --score-by summary.bestReward \ + --export \ + --json +``` + +Refresh the selected run detail: + +```sh +bun run --cwd packages/cli start run ./drone-target-demo.ignitionrl \ + drone-target-linear-policy-search \ + --export \ + --json +``` + +The Studio should not recompute rewards to explain a run. It should read the +same trace, replay and reward-debugger payloads produced by the CLI. + +## 7. Reward Authoring Checklist + +Before trusting a learner result, check that: + +- every reward cause has a stable name; +- success bonus and done success condition use the same threshold; +- penalties use `nextState` when they depend on the result of the action; +- shaping terms are positive for desired movement and negative for undesired movement; +- step penalty is large enough to discourage wandering but not larger than useful progress; +- terminal penalties and bonuses are large enough to dominate incidental shaping; +- replay frames explain the sign and magnitude of the reward terms.