Skip to content

training: summarize HF job event logs#381

Merged
AbdelStark merged 1 commit into
mainfrom
issue-370-job-event-status
Jun 5, 2026
Merged

training: summarize HF job event logs#381
AbdelStark merged 1 commit into
mainfrom
issue-370-job-event-status

Conversation

@AbdelStark
Copy link
Copy Markdown
Owner

Summary

  • Add codelewm.training.job_events to parse live CODELEWM_JOB_EVENT log lines and persisted reports/job_progress.jsonl rows.
  • Add scripts/hf-job-event-status for compact one-shot or watched status summaries across HF job IDs, including latest progress, losses, ETA, CUDA memory, checkpoint state, and the no-collapse gate.
  • Document the status command in the HF operation and scaled-training runbooks, with docs tests enforcing the helper.

Live check

The new helper was exercised against the replacement v0.8 A10G jobs from #370:

uv run scripts/hf-job-event-status 6a2278d2e6aa50b87b9eba56 6a227a6ce52fdd2a02ed9005

At the check, both jobs were RUNNING, emitting progress, and had passing step-1000 collapse diagnostics.

Validation

  • uv run pytest tests/training/test_job_events.py tests/docs/test_scaled_training_runbook.py tests/docs/test_hf_ml_intern_training.py
  • uv run pytest tests/training/test_job_events.py tests/training/test_execution_runner.py tests/training/test_execution_train_config.py tests/docs/test_scaled_training_runbook.py tests/docs/test_hf_ml_intern_training.py
  • uv run pytest tests/training/test_job_events.py
  • uv run python -m compileall -q codelewm/training/job_events.py scripts/hf-job-event-status tests/training/test_job_events.py
  • git diff --check
  • uv run codelewm secret-scan codelewm/training/job_events.py scripts/hf-job-event-status tests/training/test_job_events.py docs/training/SCALED_TRAINING_RUNBOOK.md docs/operations/HF_ML_INTERN_TRAINING.md --json
  • direct token-pattern grep over the touched Python/script/test files found no matches

Part of #370; supports #364.

@AbdelStark AbdelStark merged commit 425e7ad into main Jun 5, 2026
9 checks passed
@AbdelStark AbdelStark deleted the issue-370-job-event-status branch June 5, 2026 09:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant