Skip to content

train: guard v0.9 HF job launch plans#398

Merged
AbdelStark merged 1 commit into
mainfrom
issue-391-v0-9-guarded-hf-run
Jun 6, 2026
Merged

train: guard v0.9 HF job launch plans#398
AbdelStark merged 1 commit into
mainfrom
issue-391-v0-9-guarded-hf-run

Conversation

@AbdelStark
Copy link
Copy Markdown
Owner

Parent

Parent #385
Refs #391

Summary

Live-operation note

This PR intentionally does not close #391. It lands the code/config/runbook precondition. After merge, #391 still requires:

  • full v0.9 pack materialization and publication as abdelstark/codelewm-execution-pack@v0.9.0-rc1
  • v0.9 runtime image build/push and real digest capture
  • dry-run launch plan with the real digest
  • the two HF Jobs launches, monitoring, artifact download, manifest verification, checkpoint inspection, and secret scans

Validation

  • uv run pytest tests/training/test_execution_launch_plan.py tests/training/test_execution_train_config.py tests/containers/test_runtime_image.py tests/docs/test_hf_ml_intern_training.py tests/docs/test_scaled_training_runbook.py tests/data/execution_pack/test_passfail_pack.py (97 passed, 2 skipped)
  • uv run python -m compileall -q codelewm/training/execution_launch_plan.py tests/training/test_execution_launch_plan.py tests/training/test_execution_train_config.py tests/containers/test_runtime_image.py tests/data/execution_pack/test_passfail_pack.py
  • uv run scripts/hf-launch-execution-run --config config/train/scaled/codelewm_execution_v0_9_short_a10g.yaml --git-sha dryrun --date 20260606 --runtime-image-digest sha256:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa --require-runtime-image-digest --json
  • uv run python -m compileall -q -x 'tests/fixtures/codestate/invalid_(before|after)\.py$' codelewm tests scripts/build-passfail-pack scripts/hf-launch-execution-run
  • git diff --check
  • uv run pytest tests/ (964 passed, 8 skipped, 1 warning)

Local preflight attempts

  • Full local v0.9 pack materialization from .artifacts/wsd labels was stopped after prolonged silent execution; the builder needs progress output before this is a good operator preflight.
  • A deliberately tiny --max-completion-rows 20 smoke failed with no pass/fail records were produced, so it is not publication evidence.

@AbdelStark AbdelStark merged commit 29b4852 into main Jun 6, 2026
9 checks passed
@AbdelStark AbdelStark deleted the issue-391-v0-9-guarded-hf-run branch June 6, 2026 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

v0.9 train: guarded 2-seed HF Jobs run after data/eval preflight

1 participant