train: guard v0.9 HF job launch plans by AbdelStark · Pull Request #398 · AbdelStark/CodeLeWM

AbdelStark · 2026-06-06T11:07:45Z

Parent

Parent #385
Refs #391

add the v0.9 short A10G correctness-aware training config for the cross-benchmark pack revision v0.9.0-rc1
add v0.9 runtime container files with the existing structured CODELEWM_JOB_EVENT lifecycle contract
extend execution HF launch plans with digest-pinned runtime image references and a fail-closed --require-runtime-image-digest option
fix multi-source scripts/build-passfail-pack CLI mode so v0.9 train: guarded 2-seed HF Jobs run after data/eval preflight #391 pack materialization can use the v0.9 data: build cross-benchmark pass/fail execution pack with stratified labels #387 builder path
document the guarded v0.9 train: guarded 2-seed HF Jobs run after data/eval preflight #391 sequence: publish pack, build/push digest-pinned image, dry-run two-seed launch plan, then launch and monitor via HF CLI/status parser

This PR intentionally does not close #391. It lands the code/config/runbook precondition. After merge, #391 still requires:

full v0.9 pack materialization and publication as abdelstark/codelewm-execution-pack@v0.9.0-rc1
v0.9 runtime image build/push and real digest capture
dry-run launch plan with the real digest
the two HF Jobs launches, monitoring, artifact download, manifest verification, checkpoint inspection, and secret scans

uv run pytest tests/training/test_execution_launch_plan.py tests/training/test_execution_train_config.py tests/containers/test_runtime_image.py tests/docs/test_hf_ml_intern_training.py tests/docs/test_scaled_training_runbook.py tests/data/execution_pack/test_passfail_pack.py (97 passed, 2 skipped)
uv run python -m compileall -q codelewm/training/execution_launch_plan.py tests/training/test_execution_launch_plan.py tests/training/test_execution_train_config.py tests/containers/test_runtime_image.py tests/data/execution_pack/test_passfail_pack.py
uv run scripts/hf-launch-execution-run --config config/train/scaled/codelewm_execution_v0_9_short_a10g.yaml --git-sha dryrun --date 20260606 --runtime-image-digest sha256:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa --require-runtime-image-digest --json
uv run python -m compileall -q -x 'tests/fixtures/codestate/invalid_(before|after)\.py$' codelewm tests scripts/build-passfail-pack scripts/hf-launch-execution-run
git diff --check
uv run pytest tests/ (964 passed, 8 skipped, 1 warning)

Full local v0.9 pack materialization from .artifacts/wsd labels was stopped after prolonged silent execution; the builder needs progress output before this is a good operator preflight.
A deliberately tiny --max-completion-rows 20 smoke failed with no pass/fail records were produced, so it is not publication evidence.

train: guard v0.9 hf job launch plans

cb1b20d

AbdelStark merged commit 29b4852 into main Jun 6, 2026
9 checks passed

AbdelStark deleted the issue-391-v0-9-guarded-hf-run branch June 6, 2026 11:10

AbdelStark mentioned this pull request Jun 6, 2026

v0.9 train: guarded 2-seed HF Jobs run after data/eval preflight #391

Closed

6 tasks