Dev by haizhongzheng · Pull Request #7 · Infini-AI-Lab/astraflow

haizhongzheng · 2026-05-29T20:37:34Z

merge dev to main.

Upgrade the inference/runtime stack to the latest sglang and the dependency versions it requires, validated end-to-end on the FSDP backend (qwen3-1.7b math example, 2x L40). Version pins (pyproject.toml, docs, Docker): - sglang 0.5.5.post1 -> 0.5.12.post1 - torch 2.8.0 -> 2.11.0; torch_memory_saver 0.0.9 -> 0.0.9.post1 - transformers 4.57.1 -> 5.6.1 (sglang pins ==5.6.0, which has a flash-attention s_aux=None crash for non-sink models; 5.6.1 is the upstream patch release. Forced via [tool.uv] override-dependencies, which requires uv >= 0.10 -- documented in installation.md) - peft -> >=0.18.0 (required by transformers 5.x) - CUDA base image 12.9.1 -> 13.0.0 sglang 0.5.12 API compatibility: - remove LoRAAbortReleasePatch (the abort-path lora_registry.release() it added is now fixed upstream; keeping it would double-release) - remove enable_ep_moe from SGLangConfig (field dropped from ServerArgs) - kernel package rename sgl_kernel -> sglang_kernel in the installation validator transformers 5.x / sglang 0.5.12 runtime fixes (surfaced by the run): - rlvr workflow: apply_chat_template now returns a BatchEncoding; pass return_dict=False to get the flat list[int] the rollout path expects - fsdp apply_fsdp2: model._no_split_modules is a set in transformers 5.x; coerce to list before indexing - raas free-port range capped at 55535 so sglang's derived gRPC port (port + 10000) stays <= 65535 Scope: FSDP backend only. Megatron / VL paths are intentionally not covered here. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore: bump sglang 0.5.5.post1 -> 0.5.12.post1 (FSDP path)

sglang 0.5.12's /health round-trips through the scheduler, which stays saturated for ~30-40s during the initial unchunked prefill of ~2048 requests/engine. The old 3-strike / 30s watchdog (5s probe timeout) hard-exited a busy-but-alive engine before the first rollout batch completed, hanging the rollout pipeline at step 0. Raise the /health probe timeout 5s -> 20s so a slow-but-alive endpoint isn't marked failed, and the failure budget 3 -> 5 strikes. A crashed engine refuses connections instantly, so real-death detection stays ~50s (worst case ~100s) while the prefill ramp is tolerated. Verified: math and code qwen3-8b-m2po-delta recipes train through the ramp with zero watchdog strikes.

…ution Two from-scratch install blockers with the sglang 0.5.12 / torch 2.11 stack: - sglang 0.5.12 depends on flash-attn-4>=4.0.0b9 (a pre-release pulled in as a dependency), so resolution fails unless pre-releases are allowed. Add prerelease = "allow" to [tool.uv] so `uv pip install -e ".[sglang]"` resolves on both the conda and Docker paths. - flash-attn 2.8.3 builds from source; nvcc writes GBs of intermediates to $TMPDIR. When $TMPDIR is a small/NFS-quota'd home the build fails with "nvFatbin error: empty input" / "Disk quota exceeded" from truncated temps. Document setting CUDA_HOME and a roomy TMPDIR, switch the sglang step to the project-extra form, and clarify flash-attn (FA2, trainer) vs flash-attn-4 (pulled in by sglang).

sglang requires an unbounded "kernels", so uv resolved the latest (0.15), but transformers 5.6.1 only supports kernels<0.13 — its hub_kernels module constructs LayerRepository() without a revision/version, which kernels 0.15 rejects, so `import sglang` crashes with "Either a revision or a version must be specified." Pin to the range transformers 5.6.1 expects (0.12.x). Verified on a from-scratch env: kernels resolves to 0.12.3 and the math recipe trains.

WWWjiahui and others added 5 commits May 28, 2026 22:31

Merge pull request #5 from WWWjiahui/chore/bump-sglang-0.5.12

b1bf6de

chore: bump sglang 0.5.5.post1 -> 0.5.12.post1 (FSDP path)

haizhongzheng merged commit 6145e22 into main May 29, 2026
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev#7

Dev#7
haizhongzheng merged 5 commits into
mainfrom
dev

haizhongzheng commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

haizhongzheng commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants