Highlights
VLM Support: SkyRL now supports VLM training, both through the Tinker API as well as via the python entrypoint. We've validated stable training for single and multi-turn datasets with both text-only environments and multi-modal environment outputs. Get started here: https://docs.skyrl.ai/docs/tutorials/vision_language_rl
New inference refactor centralizing on HTTP: We've implemented a new HTTP-based refactor (*in inference_servers) for inference with vLLM. This standardizes all inference interactions over HTTP, further integrating vllm-router as a high-performance router for generation requests. We also support prefill-decode disaggregation, allows users to squeeze out more performance in multi-turn async RL use-cases. The new inference codepath is now the default - to use the legacy (inference_engines/), set _SKYRL_USE_NEW_INFERENCE=0. The inference_engines/ codepath will be removed in the next release.
vLLM Native weight syncing API integration: SkyRL's new inference servers implementation uses vLLM's native weight syncing APIs: https://docs.vllm.ai/en/latest/training/weight_transfer/
Step-wise training improvements: We've made a number of fixes to the stepwise training implementation, addressing correctness issues (#1492), implementing support for fully async training (#1536) and adding prefix-aware merging to avoid redundant forward passes (#1532)
R3 support: SkyRL now supports R3 for stabilizing training with MoE models. Currently this is limited to cases where vllm engine is within a node due to limitations from vLLM.
Nemotron 3 and Qwen 3.5 support: SkyRL now supports Nemotron 3 and Qwen 3.5 models. Nemotron 3 is supported in the FSDP and Megatron backends, while Qwen 3.5 is supported in FSDP, Megatron and Jax backends.
What's Changed
- [trivial] Fix comments numbering and make some code more concise in trainer.py by @CharlieFRuan in #1283
- [train][1/N] Native Weight Sync API: NCCL by @hao-aaron in #1271
- [ci] Fix CI from broken test imports from #1271 by @erictang000 in #1290
- [fix] Fix cuda ipc weight sync after #1271 by @erictang000 in #1292
- fix paths for instruction comments to match current location by @linde in #1294
- [CI] Skip FlashRL integration test in CI and fix failing generation test for new inference codepath by @SumanthRH in #1301
- Skip output_router_logits for granitemoehybrid models by @eltonjohnfanboy in #1295
- WIP: Restore PR changes lost during skyrl-train deprecation by @tyler-griggs in #1310
- [chore] Update
skyrlandskyrl-gymversions after 0.1.0 release by @SumanthRH in #1312 - [lint] Add isort to pre-commit by @SumanthRH in #1267
- [examples][bug] fix silent eval max generate length not overriding by @erictang000 in #1317
- [docs] Add explicit
eval_sampling_params.max_generate_lengthby @SumanthRH in #1318 - [vllm] enable mp distributed executor backend (no multi-node engines) by @erictang000 in #1300
- [train][2/N] Native Weight Syncing APIs: IPC by @hao-aaron in #1291
- [algorithm][generator] change overlong filtering to use stop reasons over checking eos token by @erictang000 in #1319
- Add rollout_is policy loss by @SamuelGabriel in #1314
- [AsyncRL] Use keep mode for pause and resume by @hao-aaron in #1179
- [skyrl][inference] Fix port collision when ports are allocated. by @nithinvc in #1302
- R3 PR: Rollout Routing Replay by @erictang000 in #1273
- [megatron] enable bucketed weight sync for non-colocated nccl weight sync in megatron by @erictang000 in #1324
- [fix] Fix placement group bundle ordering for inference engines by @SumanthRH in #1308
- [train][fix] Fix concurrency limitations in the new inference codepath by @SumanthRH in #1320
- [megatron][lora] Fix megatron lora weight syncing not initializing buckets correctly by @erictang000 in #1330
- [train] Add
worker_process_setup_hookto set mp start method tospawnby @SumanthRH in #1333 - [CI] Fix
test_inference_engines_generationafter vllm 0.16.0 upgrade; Use the correct GSM8k path fortest_generator_multi_turn_gsm8k_router_replayby @SumanthRH in #1339 - [train] Make TrainingInputBatch to PAD only to left, hence response tensors be right-aligned by @CharlieFRuan in #1285
- Revert "[train] Add
worker_process_setup_hookto set mp start method tospawn" by @SumanthRH in #1344 - [Docs] Add docs on agent integration and step-wise training by @CharlieFRuan in #1347
- [Docs] Small update on docs by @CharlieFRuan in #1348
- [train] Add validation for step-wise GeneratorOutput by @CharlieFRuan in #1281
- [megatron] rebuild weight conversion tasks per sync to prevent stale PP-collective caches with bucketing by @erictang000 in #1345
- [StepWise] Trivial fix to avg_response_length metric by @CharlieFRuan in #1351
- [CI] Make
MultiItemDataseta global variable after switch tospawnby @SumanthRH in #1346 - [train] Add support for LoRA in the new inference codepath by @SumanthRH in #1329
- [bug][algorithm] remove incorrect torch.no_grad() for kl in loss (use_kl_loss=True) by @erictang000 in #1353
- [transformers] set return dict false for transformers v5 compatibility by @erictang000 in #1325
- [skyrl][tx] Move ModelInput token extraction to backends by @nithinvc in #1352
- [tx] Fuse the projection matrices for Qwen3 by @pcmoritz in #1341
- [tx] Fuse the projection matrices for Qwen 3.5 by @pcmoritz in #1362
- Add CodeScout project to README by @CharlieFRuan in #1364
- [train] Patch vLLM v0.16.0 sleep mode to properly free model weights by @CharlieFRuan in #1365
- [tx] Optimize the decode performance by @pcmoritz in #1363
- [skyrl] Add ImageChunk and ImageAssetPointerChunk types by @nithinvc in #1361
- [train] Enable support for the
mpbackend with the new inference codepath by @SumanthRH in #1355 - Fix loss_fn_outputs right-aligned slicing in Tinker API path by @CharlieFRuan in #1367
- [bug] Move server creation and server start in the same thread by @hao-aaron in #1375
- [router replay] downcast expert router indices to uint8/int16 to reduce space by @erictang000 in #1378
- [train] Fix double-serialization of TensorBatch in pickle by @erictang000 in #1379
- Bump vLLM to 0.18 by @hao-aaron in #1374
- [train] Use a shared semaphore for all generate requests with
RemoteInferenceClient; Move tokenization to client by @SumanthRH in #1381 - [async] Add search r1 fully async script by @CharlieFRuan in #1386
- [train] Add
vLLMRouterin new inference codepath by @SumanthRH in #1385 - [trainer] refactor
dispatch_from_stagedto individually serialize DP chunks to avoid materializing whole batch on all workers by @erictang000 in #1376 - [SkyRL] Introduce /render endpoint to the new http inference client by @nithinvc in #1373
- [async] Add DAPO fully async script by @CharlieFRuan in #1390
- [examples] update command paths to include
train/in examples by @erictang000 in #1395 - [bug] fix sleep bugs by @hao-aaron in #1383
- [train][2/N] Support for Megatron PP + CP for R3 by @devpatelio in #1335
- [train] Multi-modal inputs support in FSDP2 by @nithinvc in #1331
- [bug] Fix weight sync with DP > 1 in non-colocated setups by @SumanthRH in #1399
- [train] Make
vLLMRouterthe default router in the new inference codepath by @SumanthRH in #1394 - [bug] Fix
KeyErrorforpixel_valuesinRefWorkerBase.forwardby @SumanthRH in #1402 - [megatron][checkpointing] fix checkpointing with optimizer cpu offload for dist_ckpt_optim_fully_reshardable=False by @erictang000 in #1403
- Add MaxRL mean normalization over advantages by @tamoghnokandar in #1126
- [ci] bump logprob diff gap from 7e-2 to 9e-2 for dense models by @erictang000 in #1408
- [example] Add example scripts for Nemotron models by @SumanthRH in #1409
- [skyrl] Add /sample endpoint to RemoteInferenceClient following Tinker API by @nithinvc in #1396
- [bug] Fix fork + Ray deadlock in dataset filtering by using spawn by @SumanthRH in #1415
- [CI] Add ep tests to CI by @hao-aaron in #1404
- Fix preprocess_packed_seqs crash with short sequences under CP > 1 by @CharlieFRuan in #1407
- [fix] Forward SKYRL_RAY_PG_TIMEOUT_IN_S to workers via runtime_env by @CharlieFRuan in #1406
- [BREAKING][skyrl-train] Implement loss reduction via advantage normalization and fix
token_meanreduction strategy by @justinvyu in #1296 - [megatron] Fix loss aggregation for context parallelism (CP) in Megatron by @erictang000 in #1420
- [megatron] Upgrade mbridge -> 0.3.1, megatron-core -> 0.16.1 by @erictang000 in #1412
- [bug] Fix logprobs handling and error status codes for vllm-router compatibility by @SumanthRH in #1421
- [megatron] Fix redundant downloading of shards across workers/nodes for megatron dist checkpointing by @erictang000 in #1414
- fix: correct gsm8k example paths in Modal README by @tyfeng1997 in #1429
- [test] Refactor tests to use single event loop; fix RemoteInferenceClient connector cleanup by @SumanthRH in #1387
- [Bug][CI] Increase server timeout for MoE model initialization by @SumanthRH in #1454
- [tinker] Implement model unloading for skyrl_train_backend by @pcmoritz in #1453
- [dependencies] Upgrade transformers to >=5.0.0,<=5.3.0 by @erictang000 in #1426
- [Fix][CI] Fix CI Failures: set
return_dict=Falsefor sample API tests, use proper sleep/wake up for tests with colocated engine by @SumanthRH in #1463 - [dependencies] bump vllm to 0.19.0 by @erictang000 in #1462
- Add OpenReward environment integration example by @tyfeng1997 in #1458
- [ci] Add ability to trigger skyrl-train gpu CI (and megatron CI) by adding run_train_gpu_ci and run_train_megatron_gpu_ci labels by @erictang000 in #1465
- [megatron] support qwen3.5 models for megatron, bump mbridge + megatron-core to latest by @erictang000 in #1425
- fix: use gloo for CPU collectives in Megatron workers to fix multi-node checkpointing by @CharlieFRuan in #1466
- [tinker] Forbid extra keys in EngineConfig by @pbokc in #1459
- fix: set CUDA device before gloo+nccl process group init in Megatron workers by @CharlieFRuan in #1468
- [megatron] Fix MoE weight syncing by grouping fused expert tasks into dedicated buckets, supporting qwen3.5 moe by @erictang000 in #1471
- [Fix] Add semaphore throttling to
RemoteInferenceClient.sample,RemoteInferenceClient.completion,RemoteInferenceClient.chat_completionby @SumanthRH in #1475 - [SkyRL][train] Support prompt_logprobs in /sample in the new inference stack by @nithinvc in #1417
- fix: escape bare < in generated MDX to fix Vercel doc builds by @tyler-griggs in #1478
- [R3] Enable R3 with new inference by @hao-aaron in #1428
- [skyrl] New inference client in the skyrl-train tinker backend by @nithinvc in #1452
- [fix] update
vllm.inputs.dataimport tovllm.inputsafter vllm 0.19.0 upgrade by @SumanthRH in #1482 - [skyrl] vLLM Renderer for rendering Multi-Modal ModelInputChunks for training backend by @nithinvc in #1464
- [qoc] Remove thread for sleep() in _get_new_inference_client by @CharlieFRuan in #1488
- [train] Save checkpoints at epoch boundaries by @CharlieFRuan in #1490
- [train] Final ckpt for fully async by @CharlieFRuan in #1491
- [CI] Fix moe model initialization timeout by @SumanthRH in #1495
- [feat][train] Add prefill-decode disaggregation support and refactor vLLMRouter by @SumanthRH in #1467
- [CI] Relax Megatron vs FSDP policy_loss tolerance in
test_megatron_trainby @SumanthRH in #1500 - [train][multimodal][1/3] Add vision support to generate() in new inference stack by @nithinvc in #1494
- [multimodal] add language_model_only flag for models like qwen3.5 by @erictang000 in #1487
- [fix][train][step-wise] Broadcast step-wise advantage with each step's own response_mask by @CharlieFRuan in #1507
- [fix][CI] Fix cleanup for
test_weight_syncby @SumanthRH in #1510 - [fix][train] Fix port collision for num_engines > 1 in non-PD case by @SumanthRH in #1511
- [train][multimodal][3/3] Trainer changes to extract multi-modal outputs from GeneratorOutput by @nithinvc in #1498
- [CI] Migrate non-Megatron GPU CI to run on new inference codepath by @SumanthRH in #1476
- [feat] chunked ipc support for new inference by @hao-aaron in #1512
- [CI] Fix Megatron Lora for new inference and migrate Megatron CI to new inference codepath by @SumanthRH in #1518
- [CI] Migrate E2E CI to use new inference by @SumanthRH in #1521
- [train] SFT 1/N: Add a native SFT trainer to SkyRL by @SumanthRH in #1503
- [dep] Add causal-conv1d and mamba-ssm wheels, add mamba-ssm as extra by @CharlieFRuan in #1524
- [qoc] Remove unused TrainingInputBatch.metadata["trajectory_ids"] by @CharlieFRuan in #1526
- [qoc] Extract
pad_batch()into a helper totraining_batch.pyby @CharlieFRuan in #1527 - [bug][train] Fix max_seq_len calculation by @tamoghnokandar in #1303
- [fix][train] Prompt-based mini-batching for step-wise training by @CharlieFRuan in #1529
- [train][multimodal][3/3] Add multi-turn VLM generator by @nithinvc in #1486
- [skyrl][tinker] Use VLLMRenderer in SkyRL train backend by @nithinvc in #1496
- [qoc] Make concatenate_generator_outputs linear instead of O(K^2) by @CharlieFRuan in #1535
- [CI] Fix timeout failures in E2E CI pipelines by @SumanthRH in #1533
- [stepwise] Plumb through step-wise training for fully async by @CharlieFRuan in #1536
- [train][step-wise] Add prefix-aware merging for step-wise training by @CharlieFRuan in #1532
- Revert "[train][step-wise] Add prefix-aware merging for step-wise training" by @CharlieFRuan in #1537
- [train][step-wise] Add prefix-aware merging for step-wise training by @CharlieFRuan in #1538
- [harbor] Bump harbor to HEAD, make run_harbor_gen runnable by @CharlieFRuan in #1540
- [trivial][test] Add overlong filtering test to step-wise prefix-aware merging by @CharlieFRuan in #1541
- [docs][example] VLM Examples by @nithinvc in #1531
- Make the new inference codepath the default by @SumanthRH in #1544
- [chore] Move flash attn wheel specification to
uv.sourcesby @SumanthRH in #1547 - [bug][tinker] Fix vLLMRouter init in SkyRLTrainBackend by @SumanthRH in #1554
- [train] Support packing for CUDA IPC transfer with new inference codepath by @SumanthRH in #1558
New Contributors
- @linde made their first contribution in #1294
- @eltonjohnfanboy made their first contribution in #1295
- @SamuelGabriel made their first contribution in #1314
- @nithinvc made their first contribution in #1302
- @tyfeng1997 made their first contribution in #1429
Full Changelog: skyrl-v0.1.0...skyrl-v0.2.0