Add a first neural learner path for discrete toy/debug environments.
Acceptance criteria:
- DQN trains against GridWorld or Target2D without users implementing the algorithm;
- metrics include reward, episode length, epsilon/exploration and loss;
- checkpoint save/load supports inference replay;
- CI smoke proves improvement over random on a deterministic demo.
Add a first neural learner path for discrete toy/debug environments.
Acceptance criteria: