Release v2.7.1: LLM RL Quantization & Bug Fixes · AgileRL/AgileRL

Features

LLM RL quantization (#522): Adds bitsandbytes quantization to the LLM RL post-training stack plus the memory machinery to run longer-context RL on a single smaller GPU:

Trainer-side bnb quantization (none | int8 | nf4 QLoRA), resolved from a QUANTIZATION preset by create_population; vLLM mirrors the trainer's precision (bitsandbytes rollout when quantized, dense bf16
otherwise).
Colocated vLLM rollout: vLLM and trainer each hold their own base and share the GPU via vLLM native sleep/wake; trainer base is CPU-offloaded during rollout and only LoRA adapters are synced per cycle.
CUDA-safe trainer-first init.
Always-on, memory-bounded fused/chunked linear log-probs, plus optional padding-free sequence packing (FA2-varlen / flex-attention block-sparse).
Fused multi-adapter LoRA forward (actor+critic in one pass) with per-row routing.
Importance-sampling level (token / turn / trajectory) decoupled from advantage granularity across GRPO / GSPO / CISPO / PPO / REINFORCE, plus a vLLM sampling-mismatch (truncated-IS) correction.
CI: gpu/vllm-marked tests now run in a CUDA container; bitsandbytes pinned linux-only.

Docs (#523): list previously-missing LLM algos (CISPO, GSPO, LLM PPO, LLM REINFORCE, SFT) in the README/API tables, fix the broken GRPO example, GSPO heading typo, and expand the loss_type explanation.

Bugs

EvolvableCNN RNG propagation (#546): the rng setter now also seeds mut_kernel_size, so MutableKernelSizes shares the module's generator instead of an independent RNG, restoring reproducibility of
kernel-size mutations.
PPO value-head save/load (#522): v_head is now restored on the LoRA-only load path and lr_actor is stored, so optimizer-metadata restore no longer crashes.

Dependency upgrades

tensordict 0.12.2 → 0.13.0 (#515, #526)
redis 4.4.4 → 8.0.0 (#527)
pymunk 6.2.1 → 7.2.0 (#518)
termcolor 1.1.0 → 3.3.0 (#542)
pre-commit 3.8.0 → 4.6.0 (#543)
hydra-core 1.3.2 → 1.3.3 (#537)
omegaconf 2.3.0 → 2.3.1 (#536, #552)
tqdm 4.67.3 → 4.68.0 (#525)
dill 0.4.0 → 0.4.1 (#551)
1e01a1)

What's Changed

[pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in #508
Bump tensordict from 0.12.2 to 0.12.3 by @dependabot[bot] in #515
Bump pymunk from 6.2.1 to 7.2.0 by @dependabot[bot] in #518
[pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in #528
Bump redis from 4.4.4 to 8.0.0 by @dependabot[bot] in #527
Bump tqdm from 4.67.3 to 4.68.0 by @dependabot[bot] in #525
Bump hydra-core from 1.3.2 to 1.3.3 by @dependabot[bot] in #537
Bump omegaconf from 2.3.0 to 2.3.1 by @dependabot[bot] in #536
[pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in #539
Bump tensordict from 0.12.3 to 0.13.0 by @dependabot[bot] in #526
Bump pre-commit from 3.8.0 to 4.6.0 by @dependabot[bot] in #543
Bump termcolor from 1.1.0 to 3.3.0 by @dependabot[bot] in #542
LLM RL quantization: bnb QLoRA trainer + colocated vLLM, bounded fused log-probs by @micdoh in #522
docs: list missing LLM algos and fix GRPO/CISPO/GSPO docs by @micdoh in #523
Set RNG for MutableKernelSizes too in EvolvableCNN by @jaimesabalbermudez in #546
Bump omegaconf from 2.3.0 to 2.3.1 by @dependabot[bot] in #552
Bump dill from 0.4.0 to 0.4.1 by @dependabot[bot] in #551
refactor: @hide_init_params decorator for GSPO/CISPO init signatures by @micdoh in #540
[pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in #560
LLM RL Quantization: BnB QLoRA Trainer + Colocated vLLM, Bounded Fused Log-Probs by @jaimesabalbermudez in #555
Bump 2.7.1 by @jaimesabalbermudez in #561

Full Changelog: v2.7.0...v2.7.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v2.7.1: LLM RL Quantization & Bug Fixes

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Features

Bugs

Dependency upgrades

What's Changed

Contributors

Uh oh!