Skip to content

Shm and version fixes#1

Merged
haizhongzheng merged 2 commits into
mainfrom
shm-and-version-fixes
May 19, 2026
Merged

Shm and version fixes#1
haizhongzheng merged 2 commits into
mainfrom
shm-and-version-fixes

Conversation

@haizhongzheng
Copy link
Copy Markdown
Member

Description

Related Issue

Fixes #(issue)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not
    work as expected)
  • Documentation update
  • Code refactoring (no functional changes)
  • Performance improvement
  • Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • I have run formatting tools (pre-commit or manual)
  • I have run relevant unit tests and they pass
  • I have added tests for new functionality
  • I have updated documentation if needed
  • My branch is up to date with main
  • This PR introduces breaking changes (if yes, fill out details below)
  • If this PR changes documentation, I have built and previewed it locally with
    jb build docs
  • No critical issues raised by AI reviewers (/gemini review)

Breaking Change Details (if applicable):

Additional Context


Need help? Check the Contributing Guide or ask in
GitHub Discussions!

astraflow/train_worker/version.py hardcoded 0.5.1 — a leftover from the
upstream training engine it was derived from — while astraflow/version.py
reports 0.1.0. StatsLogger reads version_info from the train_worker copy,
so training runs recorded the wrong version in their W&B run config.
Set it to 0.1.0 so both version modules agree.
The documented --shm-size=16g is too small: a recipe run co-locates the
trainer, RaaS, and SGLang in one container sharing a single /dev/shm, and
RaaS stages received weights under /dev/shm/astraflow_weights. With 16g
(or the 64 MB default) weight transfer fails with
"OSError: [Errno 28] No space left on device" on 8B-scale recipes.
Raise the example to 512g and add a note on sizing /dev/shm.
@haizhongzheng haizhongzheng merged commit 7e09fe5 into main May 19, 2026
1 check failed
@haizhongzheng haizhongzheng deleted the shm-and-version-fixes branch May 19, 2026 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant