[tx][megatron] making megatron skyrl-train worker usable as TX backend#1067
[tx][megatron] making megatron skyrl-train worker usable as TX backend#1067erictang000 merged 7 commits intoNovaSky-AI:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the training backends to support megatron in addition to jax and fsdp. The changes involve updating pyproject.toml with new dependencies and configurations for uv, and modifying the backend selection logic to accommodate megatron.
My review focuses on ensuring clarity and robustness of the new configuration. I've identified a misleading comment in pyproject.toml and a potential for silent misconfiguration in the backend selection logic which could be improved for a better user experience. Overall, the changes are in the right direction to support the new backend.
| policy_num_training_steps = None | ||
| # set to a large number for megatron scheduler init | ||
| # lr will be managed externally via set_lr() | ||
| policy_num_training_steps = 1e9 |
There was a problem hiding this comment.
Hehe ok, a little odd but I guess why not :D
| cmd = [ | ||
| "uv", | ||
| "run", | ||
| "--isolated", |
There was a problem hiding this comment.
(just documenting it for later: this was to make sure people don't run out of disk, because the megatron dependencies are very big, and --isolated will make sure they are cleaned up)
There was a problem hiding this comment.
was running into this just for fsdp as well btw
i think it's more so a ray + uv shipping working dir thing
| "flashinfer-jit-cache==0.5.3", | ||
| ] | ||
|
|
||
| conflicts = [ |
There was a problem hiding this comment.
What does this do? It might be worth adding a comment above this section saying this is for megatron and describing briefly what each part does (conflicts, no-build-isolation-package, override-dependencies). Doesn't need to be long.
There was a problem hiding this comment.
uv sync automatically tries to resolve conflicts between extras unless they are explicitly defined as conflicts. Megatron's deps clash enough to have problems with jax. The override-dependencies are mostly because the megatron-bridge dependencies are not pinned to a specific version, and also install features we don't use.
I'll leave some comments
NovaSky-AI#1067) Change the training backends to the desired state of "jax", "fsdp", "megatron". Adds megatron deps to pyproject.toml via skyrl-train[mcore] dependency, and add needed overrides to get installation working. For FSDP + vLLM: ```bash uv run --isolated --extra tinker -m tx.tinker.api --base-model "Qwen/Qwen3-0.6B" --backend fsdp ``` For Megatron + vLLM: ```bash uv run --isolated --extra tinker -m tx.tinker.api --base-model "Qwen/Qwen3-0.6B" --backend megatron ``` <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1067" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end -->
Change the training backends to the desired state of "jax", "fsdp", "megatron".
Adds megatron deps to pyproject.toml via skyrl-train[mcore] dependency, and add needed overrides to get installation working.
For FSDP + vLLM:
uv run --isolated --extra tinker -m tx.tinker.api --base-model "Qwen/Qwen3-0.6B" --backend fsdpFor Megatron + vLLM:
uv run --isolated --extra tinker -m tx.tinker.api --base-model "Qwen/Qwen3-0.6B" --backend megatron