Features
LLM RL quantization (#522): Adds bitsandbytes quantization to the LLM RL post-training stack plus the memory machinery to run longer-context RL on a single smaller GPU:
- Trainer-side bnb quantization (none | int8 | nf4 QLoRA), resolved from a QUANTIZATION preset by create_population; vLLM mirrors the trainer's precision (bitsandbytes rollout when quantized, dense bf16
otherwise). - Colocated vLLM rollout: vLLM and trainer each hold their own base and share the GPU via vLLM native sleep/wake; trainer base is CPU-offloaded during rollout and only LoRA adapters are synced per cycle.
CUDA-safe trainer-first init. - Always-on, memory-bounded fused/chunked linear log-probs, plus optional padding-free sequence packing (FA2-varlen / flex-attention block-sparse).
- Fused multi-adapter LoRA forward (actor+critic in one pass) with per-row routing.
- Importance-sampling level (token / turn / trajectory) decoupled from advantage granularity across GRPO / GSPO / CISPO / PPO / REINFORCE, plus a vLLM sampling-mismatch (truncated-IS) correction.
- CI: gpu/vllm-marked tests now run in a CUDA container; bitsandbytes pinned linux-only.
Docs (#523): list previously-missing LLM algos (CISPO, GSPO, LLM PPO, LLM REINFORCE, SFT) in the README/API tables, fix the broken GRPO example, GSPO heading typo, and expand the loss_type explanation.
Bugs
- EvolvableCNN RNG propagation (#546): the rng setter now also seeds mut_kernel_size, so
MutableKernelSizesshares the module's generator instead of an independent RNG, restoring reproducibility of
kernel-size mutations. - PPO value-head save/load (#522): v_head is now restored on the LoRA-only load path and lr_actor is stored, so optimizer-metadata restore no longer crashes.
Dependency upgrades
- tensordict 0.12.2 → 0.13.0 (#515, #526)
- redis 4.4.4 → 8.0.0 (#527)
- pymunk 6.2.1 → 7.2.0 (#518)
- termcolor 1.1.0 → 3.3.0 (#542)
- pre-commit 3.8.0 → 4.6.0 (#543)
- hydra-core 1.3.2 → 1.3.3 (#537)
- omegaconf 2.3.0 → 2.3.1 (#536, #552)
- tqdm 4.67.3 → 4.68.0 (#525)
- dill 0.4.0 → 0.4.1 (#551)
1e01a1)
What's Changed
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in #508
- Bump tensordict from 0.12.2 to 0.12.3 by @dependabot[bot] in #515
- Bump pymunk from 6.2.1 to 7.2.0 by @dependabot[bot] in #518
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in #528
- Bump redis from 4.4.4 to 8.0.0 by @dependabot[bot] in #527
- Bump tqdm from 4.67.3 to 4.68.0 by @dependabot[bot] in #525
- Bump hydra-core from 1.3.2 to 1.3.3 by @dependabot[bot] in #537
- Bump omegaconf from 2.3.0 to 2.3.1 by @dependabot[bot] in #536
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in #539
- Bump tensordict from 0.12.3 to 0.13.0 by @dependabot[bot] in #526
- Bump pre-commit from 3.8.0 to 4.6.0 by @dependabot[bot] in #543
- Bump termcolor from 1.1.0 to 3.3.0 by @dependabot[bot] in #542
- LLM RL quantization: bnb QLoRA trainer + colocated vLLM, bounded fused log-probs by @micdoh in #522
- docs: list missing LLM algos and fix GRPO/CISPO/GSPO docs by @micdoh in #523
- Set RNG for MutableKernelSizes too in EvolvableCNN by @jaimesabalbermudez in #546
- Bump omegaconf from 2.3.0 to 2.3.1 by @dependabot[bot] in #552
- Bump dill from 0.4.0 to 0.4.1 by @dependabot[bot] in #551
- refactor: @hide_init_params decorator for GSPO/CISPO init signatures by @micdoh in #540
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in #560
- LLM RL Quantization: BnB QLoRA Trainer + Colocated vLLM, Bounded Fused Log-Probs by @jaimesabalbermudez in #555
- Bump 2.7.1 by @jaimesabalbermudez in #561
Full Changelog: v2.7.0...v2.7.1