Skip to content

SkyRL: v0.2.0

Latest

Choose a tag to compare

@SumanthRH SumanthRH released this 23 Apr 00:11

Highlights

VLM Support: SkyRL now supports VLM training, both through the Tinker API as well as via the python entrypoint. We've validated stable training for single and multi-turn datasets with both text-only environments and multi-modal environment outputs. Get started here: https://docs.skyrl.ai/docs/tutorials/vision_language_rl

New inference refactor centralizing on HTTP: We've implemented a new HTTP-based refactor (*in inference_servers) for inference with vLLM. This standardizes all inference interactions over HTTP, further integrating vllm-router as a high-performance router for generation requests. We also support prefill-decode disaggregation, allows users to squeeze out more performance in multi-turn async RL use-cases. The new inference codepath is now the default - to use the legacy (inference_engines/), set _SKYRL_USE_NEW_INFERENCE=0. The inference_engines/ codepath will be removed in the next release.

vLLM Native weight syncing API integration: SkyRL's new inference servers implementation uses vLLM's native weight syncing APIs: https://docs.vllm.ai/en/latest/training/weight_transfer/

Step-wise training improvements: We've made a number of fixes to the stepwise training implementation, addressing correctness issues (#1492), implementing support for fully async training (#1536) and adding prefix-aware merging to avoid redundant forward passes (#1532)

R3 support: SkyRL now supports R3 for stabilizing training with MoE models. Currently this is limited to cases where vllm engine is within a node due to limitations from vLLM.

Nemotron 3 and Qwen 3.5 support: SkyRL now supports Nemotron 3 and Qwen 3.5 models. Nemotron 3 is supported in the FSDP and Megatron backends, while Qwen 3.5 is supported in FSDP, Megatron and Jax backends.

What's Changed

  • [trivial] Fix comments numbering and make some code more concise in trainer.py by @CharlieFRuan in #1283
  • [train][1/N] Native Weight Sync API: NCCL by @hao-aaron in #1271
  • [ci] Fix CI from broken test imports from #1271 by @erictang000 in #1290
  • [fix] Fix cuda ipc weight sync after #1271 by @erictang000 in #1292
  • fix paths for instruction comments to match current location by @linde in #1294
  • [CI] Skip FlashRL integration test in CI and fix failing generation test for new inference codepath by @SumanthRH in #1301
  • Skip output_router_logits for granitemoehybrid models by @eltonjohnfanboy in #1295
  • WIP: Restore PR changes lost during skyrl-train deprecation by @tyler-griggs in #1310
  • [chore] Update skyrl and skyrl-gym versions after 0.1.0 release by @SumanthRH in #1312
  • [lint] Add isort to pre-commit by @SumanthRH in #1267
  • [examples][bug] fix silent eval max generate length not overriding by @erictang000 in #1317
  • [docs] Add explicit eval_sampling_params.max_generate_length by @SumanthRH in #1318
  • [vllm] enable mp distributed executor backend (no multi-node engines) by @erictang000 in #1300
  • [train][2/N] Native Weight Syncing APIs: IPC by @hao-aaron in #1291
  • [algorithm][generator] change overlong filtering to use stop reasons over checking eos token by @erictang000 in #1319
  • Add rollout_is policy loss by @SamuelGabriel in #1314
  • [AsyncRL] Use keep mode for pause and resume by @hao-aaron in #1179
  • [skyrl][inference] Fix port collision when ports are allocated. by @nithinvc in #1302
  • R3 PR: Rollout Routing Replay by @erictang000 in #1273
  • [megatron] enable bucketed weight sync for non-colocated nccl weight sync in megatron by @erictang000 in #1324
  • [fix] Fix placement group bundle ordering for inference engines by @SumanthRH in #1308
  • [train][fix] Fix concurrency limitations in the new inference codepath by @SumanthRH in #1320
  • [megatron][lora] Fix megatron lora weight syncing not initializing buckets correctly by @erictang000 in #1330
  • [train] Add worker_process_setup_hook to set mp start method to spawn by @SumanthRH in #1333
  • [CI] Fix test_inference_engines_generation after vllm 0.16.0 upgrade; Use the correct GSM8k path for test_generator_multi_turn_gsm8k_router_replay by @SumanthRH in #1339
  • [train] Make TrainingInputBatch to PAD only to left, hence response tensors be right-aligned by @CharlieFRuan in #1285
  • Revert "[train] Add worker_process_setup_hook to set mp start method to spawn" by @SumanthRH in #1344
  • [Docs] Add docs on agent integration and step-wise training by @CharlieFRuan in #1347
  • [Docs] Small update on docs by @CharlieFRuan in #1348
  • [train] Add validation for step-wise GeneratorOutput by @CharlieFRuan in #1281
  • [megatron] rebuild weight conversion tasks per sync to prevent stale PP-collective caches with bucketing by @erictang000 in #1345
  • [StepWise] Trivial fix to avg_response_length metric by @CharlieFRuan in #1351
  • [CI] Make MultiItemDataset a global variable after switch to spawn by @SumanthRH in #1346
  • [train] Add support for LoRA in the new inference codepath by @SumanthRH in #1329
  • [bug][algorithm] remove incorrect torch.no_grad() for kl in loss (use_kl_loss=True) by @erictang000 in #1353
  • [transformers] set return dict false for transformers v5 compatibility by @erictang000 in #1325
  • [skyrl][tx] Move ModelInput token extraction to backends by @nithinvc in #1352
  • [tx] Fuse the projection matrices for Qwen3 by @pcmoritz in #1341
  • [tx] Fuse the projection matrices for Qwen 3.5 by @pcmoritz in #1362
  • Add CodeScout project to README by @CharlieFRuan in #1364
  • [train] Patch vLLM v0.16.0 sleep mode to properly free model weights by @CharlieFRuan in #1365
  • [tx] Optimize the decode performance by @pcmoritz in #1363
  • [skyrl] Add ImageChunk and ImageAssetPointerChunk types by @nithinvc in #1361
  • [train] Enable support for the mp backend with the new inference codepath by @SumanthRH in #1355
  • Fix loss_fn_outputs right-aligned slicing in Tinker API path by @CharlieFRuan in #1367
  • [bug] Move server creation and server start in the same thread by @hao-aaron in #1375
  • [router replay] downcast expert router indices to uint8/int16 to reduce space by @erictang000 in #1378
  • [train] Fix double-serialization of TensorBatch in pickle by @erictang000 in #1379
  • Bump vLLM to 0.18 by @hao-aaron in #1374
  • [train] Use a shared semaphore for all generate requests with RemoteInferenceClient; Move tokenization to client by @SumanthRH in #1381
  • [async] Add search r1 fully async script by @CharlieFRuan in #1386
  • [train] Add vLLMRouter in new inference codepath by @SumanthRH in #1385
  • [trainer] refactor dispatch_from_staged to individually serialize DP chunks to avoid materializing whole batch on all workers by @erictang000 in #1376
  • [SkyRL] Introduce /render endpoint to the new http inference client by @nithinvc in #1373
  • [async] Add DAPO fully async script by @CharlieFRuan in #1390
  • [examples] update command paths to include train/ in examples by @erictang000 in #1395
  • [bug] fix sleep bugs by @hao-aaron in #1383
  • [train][2/N] Support for Megatron PP + CP for R3 by @devpatelio in #1335
  • [train] Multi-modal inputs support in FSDP2 by @nithinvc in #1331
  • [bug] Fix weight sync with DP > 1 in non-colocated setups by @SumanthRH in #1399
  • [train] Make vLLMRouter the default router in the new inference codepath by @SumanthRH in #1394
  • [bug] Fix KeyError for pixel_values in RefWorkerBase.forward by @SumanthRH in #1402
  • [megatron][checkpointing] fix checkpointing with optimizer cpu offload for dist_ckpt_optim_fully_reshardable=False by @erictang000 in #1403
  • Add MaxRL mean normalization over advantages by @tamoghnokandar in #1126
  • [ci] bump logprob diff gap from 7e-2 to 9e-2 for dense models by @erictang000 in #1408
  • [example] Add example scripts for Nemotron models by @SumanthRH in #1409
  • [skyrl] Add /sample endpoint to RemoteInferenceClient following Tinker API by @nithinvc in #1396
  • [bug] Fix fork + Ray deadlock in dataset filtering by using spawn by @SumanthRH in #1415
  • [CI] Add ep tests to CI by @hao-aaron in #1404
  • Fix preprocess_packed_seqs crash with short sequences under CP > 1 by @CharlieFRuan in #1407
  • [fix] Forward SKYRL_RAY_PG_TIMEOUT_IN_S to workers via runtime_env by @CharlieFRuan in #1406
  • [BREAKING][skyrl-train] Implement loss reduction via advantage normalization and fix token_mean reduction strategy by @justinvyu in #1296
  • [megatron] Fix loss aggregation for context parallelism (CP) in Megatron by @erictang000 in #1420
  • [megatron] Upgrade mbridge -> 0.3.1, megatron-core -> 0.16.1 by @erictang000 in #1412
  • [bug] Fix logprobs handling and error status codes for vllm-router compatibility by @SumanthRH in #1421
  • [megatron] Fix redundant downloading of shards across workers/nodes for megatron dist checkpointing by @erictang000 in #1414
  • fix: correct gsm8k example paths in Modal README by @tyfeng1997 in #1429
  • [test] Refactor tests to use single event loop; fix RemoteInferenceClient connector cleanup by @SumanthRH in #1387
  • [Bug][CI] Increase server timeout for MoE model initialization by @SumanthRH in #1454
  • [tinker] Implement model unloading for skyrl_train_backend by @pcmoritz in #1453
  • [dependencies] Upgrade transformers to >=5.0.0,<=5.3.0 by @erictang000 in #1426
  • [Fix][CI] Fix CI Failures: set return_dict=False for sample API tests, use proper sleep/wake up for tests with colocated engine by @SumanthRH in #1463
  • [dependencies] bump vllm to 0.19.0 by @erictang000 in #1462
  • Add OpenReward environment integration example by @tyfeng1997 in #1458
  • [ci] Add ability to trigger skyrl-train gpu CI (and megatron CI) by adding run_train_gpu_ci and run_train_megatron_gpu_ci labels by @erictang000 in #1465
  • [megatron] support qwen3.5 models for megatron, bump mbridge + megatron-core to latest by @erictang000 in #1425
  • fix: use gloo for CPU collectives in Megatron workers to fix multi-node checkpointing by @CharlieFRuan in #1466
  • [tinker] Forbid extra keys in EngineConfig by @pbokc in #1459
  • fix: set CUDA device before gloo+nccl process group init in Megatron workers by @CharlieFRuan in #1468
  • [megatron] Fix MoE weight syncing by grouping fused expert tasks into dedicated buckets, supporting qwen3.5 moe by @erictang000 in #1471
  • [Fix] Add semaphore throttling to RemoteInferenceClient.sample, RemoteInferenceClient.completion, RemoteInferenceClient.chat_completion by @SumanthRH in #1475
  • [SkyRL][train] Support prompt_logprobs in /sample in the new inference stack by @nithinvc in #1417
  • fix: escape bare < in generated MDX to fix Vercel doc builds by @tyler-griggs in #1478
  • [R3] Enable R3 with new inference by @hao-aaron in #1428
  • [skyrl] New inference client in the skyrl-train tinker backend by @nithinvc in #1452
  • [fix] update vllm.inputs.data import to vllm.inputs after vllm 0.19.0 upgrade by @SumanthRH in #1482
  • [skyrl] vLLM Renderer for rendering Multi-Modal ModelInputChunks for training backend by @nithinvc in #1464
  • [qoc] Remove thread for sleep() in _get_new_inference_client by @CharlieFRuan in #1488
  • [train] Save checkpoints at epoch boundaries by @CharlieFRuan in #1490
  • [train] Final ckpt for fully async by @CharlieFRuan in #1491
  • [CI] Fix moe model initialization timeout by @SumanthRH in #1495
  • [feat][train] Add prefill-decode disaggregation support and refactor vLLMRouter by @SumanthRH in #1467
  • [CI] Relax Megatron vs FSDP policy_loss tolerance in test_megatron_train by @SumanthRH in #1500
  • [train][multimodal][1/3] Add vision support to generate() in new inference stack by @nithinvc in #1494
  • [multimodal] add language_model_only flag for models like qwen3.5 by @erictang000 in #1487
  • [fix][train][step-wise] Broadcast step-wise advantage with each step's own response_mask by @CharlieFRuan in #1507
  • [fix][CI] Fix cleanup for test_weight_sync by @SumanthRH in #1510
  • [fix][train] Fix port collision for num_engines > 1 in non-PD case by @SumanthRH in #1511
  • [train][multimodal][3/3] Trainer changes to extract multi-modal outputs from GeneratorOutput by @nithinvc in #1498
  • [CI] Migrate non-Megatron GPU CI to run on new inference codepath by @SumanthRH in #1476
  • [feat] chunked ipc support for new inference by @hao-aaron in #1512
  • [CI] Fix Megatron Lora for new inference and migrate Megatron CI to new inference codepath by @SumanthRH in #1518
  • [CI] Migrate E2E CI to use new inference by @SumanthRH in #1521
  • [train] SFT 1/N: Add a native SFT trainer to SkyRL by @SumanthRH in #1503
  • [dep] Add causal-conv1d and mamba-ssm wheels, add mamba-ssm as extra by @CharlieFRuan in #1524
  • [qoc] Remove unused TrainingInputBatch.metadata["trajectory_ids"] by @CharlieFRuan in #1526
  • [qoc] Extract pad_batch() into a helper to training_batch.py by @CharlieFRuan in #1527
  • [bug][train] Fix max_seq_len calculation by @tamoghnokandar in #1303
  • [fix][train] Prompt-based mini-batching for step-wise training by @CharlieFRuan in #1529
  • [train][multimodal][3/3] Add multi-turn VLM generator by @nithinvc in #1486
  • [skyrl][tinker] Use VLLMRenderer in SkyRL train backend by @nithinvc in #1496
  • [qoc] Make concatenate_generator_outputs linear instead of O(K^2) by @CharlieFRuan in #1535
  • [CI] Fix timeout failures in E2E CI pipelines by @SumanthRH in #1533
  • [stepwise] Plumb through step-wise training for fully async by @CharlieFRuan in #1536
  • [train][step-wise] Add prefix-aware merging for step-wise training by @CharlieFRuan in #1532
  • Revert "[train][step-wise] Add prefix-aware merging for step-wise training" by @CharlieFRuan in #1537
  • [train][step-wise] Add prefix-aware merging for step-wise training by @CharlieFRuan in #1538
  • [harbor] Bump harbor to HEAD, make run_harbor_gen runnable by @CharlieFRuan in #1540
  • [trivial][test] Add overlong filtering test to step-wise prefix-aware merging by @CharlieFRuan in #1541
  • [docs][example] VLM Examples by @nithinvc in #1531
  • Make the new inference codepath the default by @SumanthRH in #1544
  • [chore] Move flash attn wheel specification to uv.sources by @SumanthRH in #1547
  • [bug][tinker] Fix vLLMRouter init in SkyRLTrainBackend by @SumanthRH in #1554
  • [train] Support packing for CUDA IPC transfer with new inference codepath by @SumanthRH in #1558

New Contributors

Full Changelog: skyrl-v0.1.0...skyrl-v0.2.0