Skip to content

[tx][megatron] making megatron skyrl-train worker usable as TX backend#1067

Merged
erictang000 merged 7 commits intoNovaSky-AI:mainfrom
erictang000:tx_megatron
Feb 11, 2026
Merged

[tx][megatron] making megatron skyrl-train worker usable as TX backend#1067
erictang000 merged 7 commits intoNovaSky-AI:mainfrom
erictang000:tx_megatron

Conversation

@erictang000
Copy link
Collaborator

@erictang000 erictang000 commented Feb 11, 2026

Change the training backends to the desired state of "jax", "fsdp", "megatron".

Adds megatron deps to pyproject.toml via skyrl-train[mcore] dependency, and add needed overrides to get installation working.

For FSDP + vLLM:

uv run --isolated --extra tinker -m tx.tinker.api  --base-model "Qwen/Qwen3-0.6B" --backend fsdp

For Megatron + vLLM:

uv run --isolated --extra tinker -m tx.tinker.api  --base-model "Qwen/Qwen3-0.6B" --backend megatron

Open with Devin

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 4 additional findings in Devin Review.

Open in Devin Review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the training backends to support megatron in addition to jax and fsdp. The changes involve updating pyproject.toml with new dependencies and configurations for uv, and modifying the backend selection logic to accommodate megatron.

My review focuses on ensuring clarity and robustness of the new configuration. I've identified a misleading comment in pyproject.toml and a potential for silent misconfiguration in the backend selection logic which could be improved for a better user experience. Overall, the changes are in the right direction to support the new backend.

policy_num_training_steps = None
# set to a large number for megatron scheduler init
# lr will be managed externally via set_lr()
policy_num_training_steps = 1e9
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe ok, a little odd but I guess why not :D

cmd = [
"uv",
"run",
"--isolated",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just documenting it for later: this was to make sure people don't run out of disk, because the megatron dependencies are very big, and --isolated will make sure they are cleaned up)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was running into this just for fsdp as well btw

i think it's more so a ray + uv shipping working dir thing

"flashinfer-jit-cache==0.5.3",
]

conflicts = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do? It might be worth adding a comment above this section saying this is for megatron and describing briefly what each part does (conflicts, no-build-isolation-package, override-dependencies). Doesn't need to be long.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uv sync automatically tries to resolve conflicts between extras unless they are explicitly defined as conflicts. Megatron's deps clash enough to have problems with jax. The override-dependencies are mostly because the megatron-bridge dependencies are not pinned to a specific version, and also install features we don't use.

I'll leave some comments

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 9 additional findings in Devin Review.

Open in Devin Review

@erictang000 erictang000 merged commit 9172197 into NovaSky-AI:main Feb 11, 2026
3 of 5 checks passed
@erictang000 erictang000 deleted the tx_megatron branch February 11, 2026 02:33
erictang000 added a commit to erictang000/SkyRL that referenced this pull request Feb 11, 2026
NovaSky-AI#1067)

Change the training backends to the desired state of "jax", "fsdp",
"megatron".

Adds megatron deps to pyproject.toml via skyrl-train[mcore] dependency,
and add needed overrides to get installation working.

For FSDP + vLLM:

```bash
uv run --isolated --extra tinker -m tx.tinker.api  --base-model "Qwen/Qwen3-0.6B" --backend fsdp
```

For Megatron + vLLM:

```bash
uv run --isolated --extra tinker -m tx.tinker.api  --base-model "Qwen/Qwen3-0.6B" --backend megatron
```


<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1067"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->
@pcmoritz pcmoritz added the tx label Feb 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants