refactor(megatron_training_lib): add ordered fallback chains for log parsing [AIMVT-161]#158
Merged
Merged
Conversation
…ble [AIMVT-160]
Makes the in-container HCA-id verification regex inside
`exec_nic_setup_scripts` configurable via a new optional `hca_id_pattern`
key in the training config (default `bnxt_|rocep`). Replaces the
previously hardcoded `bnxt_` literal so users with Mellanox/RoCE or
other RDMA NICs can extend the match (e.g. `bnxt_|rocep|mlx5_`)
without patching the lib.
Changes:
- `cvs/lib/megatron_training_lib.py`: add `hca_id_pattern` setdefault and
thread `self.hca_id_pattern` into the `re.search` call after the
libbnxt copy.
- `cvs/input/config_file/training/megatron/mi3xx_megatron_llama_distributed.json`:
add `_example_hca_id_pattern` and `hca_id_pattern` keys (distributed
only; the libbnxt workaround never runs in single-node mode).
- `docs/reference/configuration-files/megatron.rst`: update the embedded
distributed JSON dropdown and add two rows to the distributed `config`
parameters table.
- `cvs/lib/unittests/test_megatron_training_lib.py`: add
`TestExecNicSetupHcaIdPattern` with two cases (default+rocep, override+mlx5)
that together catch every plausible regression: missing setdefault,
missing f-string interpolation, wrong default value, and stale
hardcoded literal.
Repro:
.test_venv/bin/python -m unittest discover -s cvs/lib/unittests \
-p "test_megatron_training_lib.py" -v
# expect: 3 tests (1 from PR0 + 2 new), all "ok"
Integration: run `cvs run megatron_llama3_1_8b_distributed` with
`hca_id_pattern: "bnxt_|rocep"` in `mi3xx_megatron_llama_distributed.json`;
expect identical behavior to before.
…o config [AIMVT-162]
Moves the in-container Megatron-LM root and the per-tokenizer
training-script paths out of `cvs/lib/megatron_training_lib.py` and
into the training config (new `megatron_root` and `training_scripts`
keys, both with defaults that match the previously hardcoded values).
The if/elif tokenizer-to-script chain is replaced by a config-driven
loop, so adding a new model family (e.g. `llama-4`) becomes a config
edit, not a code edit. All `cd`, `self.training_script`, and the
line-430 `sed` target now derive from these keys.
Side effect — fixes a latent bug: the line-430 `sed` previously rewrote
`train_llama3.sh` even when `self.training_script` was
`train_llama2.sh`. Now uses `self.training_script` consistently, so
the right script gets patched for both model families.
Out of scope (queued):
- `extra_ld_library_paths` for the `/usr/local/lib/` prepended at line 331.
- libbnxt source/dest paths in `exec_nic_setup_scripts` (lines 304-307).
- Raise on no-match in the `training_scripts` loop instead of silent `None`
(preserves existing if/elif fallthrough behavior — not a new bug).
Changes:
- `cvs/lib/megatron_training_lib.py`: add `megatron_root` and
`training_scripts` setdefaults; replace the if/elif at lines 244-247
with a config-driven loop; thread `self.megatron_root` into
`build_training_job_cmd` (3 sites); thread `self.training_script`
into `start_training_job`'s sed target.
- `cvs/input/config_file/training/megatron/{mi3xx_megatron_llama_distributed,mi3xx_megatron_llama_single,mi35x_megatron_llama_single}.json`:
add `megatron_root` and `training_scripts` keys to each.
- `docs/reference/configuration-files/megatron.rst`: update all three
embedded JSON dropdowns and add two rows to each `config` parameters
table (single, mi35x single, distributed).
- `cvs/lib/unittests/test_megatron_training_lib.py`: add
`TestMegatronRootPropagation` with one test, three assertions
(no leakage, exact-equals on training_script, cd-site substitution).
Repro:
.test_venv/bin/python -m unittest discover -s cvs/lib/unittests \
-p "test_megatron_training_lib.py" -v
# expect: 4 tests passing (1 PR0 + 2 PR1 + 1 PR3)
Integration: run `cvs run megatron_llama3_1_8b_distributed` with default
config; expect identical behavior to before. Llama-2 runs now patch
`TRAIN_LOG=` on `train_llama2.sh` instead of (incorrectly)
`train_llama3.sh`.
…parsing [AIMVT-161]
Refactors the inline regex parsing/progress/NaN checks in
`cvs/lib/megatron_training_lib.py` into ordered fallback chains
(`TRAINING_RESULT_PATTERNS`, `TRAINING_PROGRESS_PATTERNS`,
`TRAINING_NAN_PATTERNS`) and pure module-level helpers
(`_parse_training_results`, `_is_training_complete`,
`_has_nan_inf_results`).
Each chain is seeded with `[new, old]` so newer Megatron output
(`throughput per GPU (TFLOP/s/GPU): N`) is preferred but older output
(`throughput per GPU: N`) still parses correctly. First non-empty match
wins; adding a third format becomes a one-line append.
The helpers are pure functions on a single string so they're testable
without `phdl`, `time.sleep(80)`, or any of `poll_for_training_completion`'s
loop machinery — `get_training_results_dict` and
`poll_for_training_completion` are now thin wrappers.
No config or doc changes — pattern lists are intentionally library-internal.
Out of scope (queued):
- Making the pattern lists configurable via the JSON config (revisit if
a third format shows up).
- Fixing the long-standing `[NaN|Inf]` character-class bug noted in
the existing code (separate ticket — behavior-preserving today).
Changes:
- `cvs/lib/megatron_training_lib.py`:
- Add `TRAINING_RESULT_PATTERNS`, `TRAINING_PROGRESS_PATTERNS`,
`TRAINING_NAN_PATTERNS` module-level constants.
- Add `_parse_training_results`, `_is_training_complete`,
`_has_nan_inf_results` pure helpers.
- Refactor `get_training_results_dict` to call `_parse_training_results`.
- Refactor `poll_for_training_completion` to call `_is_training_complete`
and `_has_nan_inf_results` (replaces inline `re.search` calls).
- `cvs/lib/unittests/test_megatron_training_lib.py`:
- Add `TestTrainingLogParsing` (2 cases: new-format + old-format-falls-back).
- Add `TestProgressDetection` (2 cases: new-format + old-format).
Repro:
.test_venv/bin/python -m unittest discover -s cvs/lib/unittests \
-p "test_megatron_training_lib.py" -v
# expect: 8 tests passing (1 PR0 + 2 PR1 + 1 PR3 + 4 PR2)
anujmittal-amd
approved these changes
May 12, 2026
anujmittal-amd
left a comment
There was a problem hiding this comment.
Other than the comment below along with the hca pattern comment; LGTM for merge.
…aN|Inf)` [AIMVT-161] Per review on PR #158: `[NaN|Inf]` is a regex character class matching ONE char from {N,a,I,n,f,|}, not the strings `NaN` or `Inf`. Replace with the alternation group `(?:NaN|Inf)` to match the intended literals in all four `TRAINING_NAN_PATTERNS` entries. In-practice impact: minimal today — real NaN/Inf output starts with `N`/`I` (both in the buggy class) so true positives fired; numeric output starts with digits (none in the class) so no false positives on real numbers. The fix tightens the regex against false positives on non-numeric junk starting with `a`/`n`/`f`/`|`. Previously deferred (Out-of-scope in the prior commit); pulled into this PR per reviewer request. Stale `# NOTE:` comment apologizing for the bug is removed.
Add `TestNanInfDetection` exercising `_has_nan_inf_results` on: - `... (TFLOP/s/GPU): NaN` — new format true-positive - `... : Inf` — old format true-positive (fallback chain) - `... (TFLOP/s/GPU): 612.5` — no false positive on numeric output - `... (TFLOP/s/GPU): aaa` — proof-of-detection: under the prior `[NaN|Inf]` char class this returned True (matched single char `a`); under the fixed `(?:NaN|Inf)` it returns False. Cases run as `subTest` rows on a single `test_has_nan_inf_results` method (matches the existing `cvs/unittests/test_main.py` precedent; `pytest.mark.parametrize` doesn't work on `unittest.TestCase` methods, and dropping `TestCase` would make the class invisible to `run_all_unittests.py`'s `unittest.TestLoader.discover()`). Repro: python -m unittest cvs.lib.unittests.test_megatron_training_lib -v # expect: 9 tests passing
atnair-amd
added a commit
that referenced
this pull request
May 12, 2026
… of raw interpolation [AIMVT-160] Per @anujmittal-amd review on PR #157: the previous shape interpolated the config-supplied `hca_id_pattern` raw into `re.search`, putting the burden of regex correctness on the user. A typo like `mlx5+` (intending wildcard) silently false-matched `mlx5550000`; misconfigured segments went undetected because the failure surface was the misleading `fail_test('Broadcom libbnxt rdma driver is not properly copied on node ...')`, which points at the copy step rather than the config. Treat `hca_id_pattern` as a `|`-separated list of literal NIC-name prefixes. The lib splits on `|`, strips per-segment whitespace, escapes each segment with `re.escape`, and rejoins with `|` to build the verification regex. For the default value `bnxt_|rocep`: effective regex: `hca_id:\s+(bnxt_|rocep)` -- byte-identical to before. Backward-compatible by design: the `hca_id_pattern` key name and the default value are unchanged from `844a1d3`. Existing configs from PR #157 need no migration. The contract change is invisible to users supplying plain prefixes; only previously-broken configs (regex specials inside a segment) now fail loudly instead of false-matching. Changes: - `cvs/lib/megatron_training_lib.py`: replace the inline raw-interpolation `re.search` with a built-then-used regex inside `exec_nic_setup_scripts` (single 8-line block; key name + setdefault + attribute name unchanged). - `docs/reference/configuration-files/megatron.rst`: update the `hca_id_pattern` row description to spell out the contract -- "`|`-separated list of NIC-name prefixes ... each segment is treated as a literal prefix (regex special chars are escaped by the lib)". No JSON config change (defaults are byte-identical), no rename, no backward-compat shim. Out of scope (queued): - Same `re.escape` treatment for `nic_type` (also a regex-string config key, same input-error class). Reviewer didn't flag it; separate ticket. - Validating non-empty `hca_id_pattern` at __init__ with a helpful error. - Rebasing PRs #158 / #159 onto this commit (mechanical follow-up).
atnair-amd
added a commit
that referenced
this pull request
May 12, 2026
…ble [AIMVT-160] (#157) * feat(megatron_training_lib): make HCA-id verification regex configurable [AIMVT-160] Makes the in-container HCA-id verification regex inside `exec_nic_setup_scripts` configurable via a new optional `hca_id_pattern` key in the training config (default `bnxt_|rocep`). Replaces the previously hardcoded `bnxt_` literal so users with Mellanox/RoCE or other RDMA NICs can extend the match (e.g. `bnxt_|rocep|mlx5_`) without patching the lib. Changes: - `cvs/lib/megatron_training_lib.py`: add `hca_id_pattern` setdefault and thread `self.hca_id_pattern` into the `re.search` call after the libbnxt copy. - `cvs/input/config_file/training/megatron/mi3xx_megatron_llama_distributed.json`: add `_example_hca_id_pattern` and `hca_id_pattern` keys (distributed only; the libbnxt workaround never runs in single-node mode). - `docs/reference/configuration-files/megatron.rst`: update the embedded distributed JSON dropdown and add two rows to the distributed `config` parameters table. - `cvs/lib/unittests/test_megatron_training_lib.py`: add `TestExecNicSetupHcaIdPattern` with two cases (default+rocep, override+mlx5) that together catch every plausible regression: missing setdefault, missing f-string interpolation, wrong default value, and stale hardcoded literal. Repro: .test_venv/bin/python -m unittest discover -s cvs/lib/unittests \ -p "test_megatron_training_lib.py" -v # expect: 3 tests (1 from PR0 + 2 new), all "ok" Integration: run `cvs run megatron_llama3_1_8b_distributed` with `hca_id_pattern: "bnxt_|rocep"` in `mi3xx_megatron_llama_distributed.json`; expect identical behavior to before. * fix(megatron_training_lib): re.escape hca_id_pattern segments instead of raw interpolation [AIMVT-160] Per @anujmittal-amd review on PR #157: the previous shape interpolated the config-supplied `hca_id_pattern` raw into `re.search`, putting the burden of regex correctness on the user. A typo like `mlx5+` (intending wildcard) silently false-matched `mlx5550000`; misconfigured segments went undetected because the failure surface was the misleading `fail_test('Broadcom libbnxt rdma driver is not properly copied on node ...')`, which points at the copy step rather than the config. Treat `hca_id_pattern` as a `|`-separated list of literal NIC-name prefixes. The lib splits on `|`, strips per-segment whitespace, escapes each segment with `re.escape`, and rejoins with `|` to build the verification regex. For the default value `bnxt_|rocep`: effective regex: `hca_id:\s+(bnxt_|rocep)` -- byte-identical to before. Backward-compatible by design: the `hca_id_pattern` key name and the default value are unchanged from `844a1d3`. Existing configs from PR #157 need no migration. The contract change is invisible to users supplying plain prefixes; only previously-broken configs (regex specials inside a segment) now fail loudly instead of false-matching. Changes: - `cvs/lib/megatron_training_lib.py`: replace the inline raw-interpolation `re.search` with a built-then-used regex inside `exec_nic_setup_scripts` (single 8-line block; key name + setdefault + attribute name unchanged). - `docs/reference/configuration-files/megatron.rst`: update the `hca_id_pattern` row description to spell out the contract -- "`|`-separated list of NIC-name prefixes ... each segment is treated as a literal prefix (regex special chars are escaped by the lib)". No JSON config change (defaults are byte-identical), no rename, no backward-compat shim. Out of scope (queued): - Same `re.escape` treatment for `nic_type` (also a regex-string config key, same input-error class). Reviewer didn't flag it; separate ticket. - Validating non-empty `hca_id_pattern` at __init__ with a helpful error. - Rebasing PRs #158 / #159 onto this commit (mechanical follow-up). * test(megatron_training_lib): cover hca_id_pattern whitespace + regex-escape [AIMVT-160] Add two tests to `TestExecNicSetupHcaIdPattern` exercising the hardening from the prior commit. The existing two tests (default-rocep and override-mlx5) are unchanged -- the new behavior is byte-identical for their inputs. - `test_whitespace_around_segments_is_stripped`: `'bnxt_| rocep | mlx5_'` with surrounding whitespace per segment must still match `rocep1s0f0`. Catches a refactor that drops the .strip() call from the parser. - `test_special_regex_chars_in_segment_are_escaped`: PROOF-OF-DETECTION for the `re.escape` fix. Config `hca_id_pattern='mlx5+'` plus devinfo `hca_id:\tmlx5550000`: - With `re.escape`: `+` is treated as literal; emitted regex is `mlx5\+`; no match -> `fail_test` IS called. - Without `re.escape` (the prior raw-interpolation shape): `5+` becomes a quantifier matching `550000` -> `fail_test` is NOT called -> the libbnxt-copy verify silently passes on the wrong NIC. Verified by temporarily reverting the parser to raw-interpolation: `AssertionError: Expected 'fail_test' to have been called once. Called 0 times.` Class docstring extended to note the new contract (segments are literal prefixes, not regex fragments). Repro: python -m unittest cvs.lib.unittests.test_megatron_training_lib -v # expect: 5 tests passing (1 smoke + 4 in TestExecNicSetupHcaIdPattern) * feat(megatron_training_lib): fail loudly on empty hca_id_pattern segments [AIMVT-160] Closes the residual edge case from `bef8d29`'s safety pass. After the parse+escape pipeline, an empty `segments` list (from a config value of `""`, `"|||"`, `" "`, or any all-separator/whitespace input) was producing the degenerate regex `hca_id:\s+()` which silently matches every `ibv_devinfo` `hca_id:` line -- the libbnxt-copy verification would pass on the wrong NIC, exactly the "input user error" failure class the surrounding work set out to eliminate. Add an inline guard between the segment parse and the regex build: if zero non-empty segments, abort with `fail_test` and the offending raw value in the message so the user can find it quickly. Validation lives at-use, not at `__init__`. `hca_id_pattern` is only consumed on the distributed-training + Broadcom/Thor code path, so init-time validation would false-fail on single-node or Mellanox configs that legitimately ignore the key. Changes: - `cvs/lib/megatron_training_lib.py`: 5-line guard added inside `exec_nic_setup_scripts` between the `segments` list comprehension and the `hca_id_regex` build. * test(megatron_training_lib): cover hca_id_pattern empty-input validation [AIMVT-160] Add `test_empty_pattern_aborts_with_fail_test` to `TestExecNicSetupHcaIdPattern`: config `hca_id_pattern=''` must trigger the inline guard from the prior commit and call `fail_test`. PROOF-OF-DETECTION verified by removing the guard from `exec_nic_setup_scripts`: AssertionError: Expected 'fail_test' to have been called once. Called 0 times. Without the guard, the empty input parses to zero segments, the join emits `hca_id:\s+()`, the empty-capture regex matches every devinfo line, and the for-loop's `if not re.search(...)` is never True -- silently passing the libbnxt-copy verification on whatever NIC happens to be present. Repro: python -m unittest cvs.lib.unittests.test_megatron_training_lib -v # expect: 6 tests passing (1 smoke + 5 in TestExecNicSetupHcaIdPattern)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes [AIMVT-161]. Refactors the inline regex parsing/progress/NaN checks in
cvs/lib/megatron_training_lib.pyinto ordered fallback chains and pure module-level helpers. Each chain is seeded with[new, old]so newer Megatron output (throughput per GPU (TFLOP/s/GPU): N) is preferred but older output (throughput per GPU: N) still parses correctly. No config or doc changes — pattern lists are intentionally library-internal.Why
The parsing regex was edited in place when newer Megatron builds changed the output format (
throughput per GPU:tothroughput per GPU (TFLOP/s/GPU):), losing the ability to parse logs from the older builds. A fallback chain is the structural fix: future format changes append a new pattern instead of overwriting an old one, and old logs keep parsing.The progress regex was also buried inside
poll_for_training_completion'stime.sleep(80)-then-loop machinery, which made it un-unit-testable without monkeypatching sleeps +phdl.exec+scan_for_training_errors+get_training_results_dict. Extracting pure helpers makes both the parsing and progress/NaN checks testable with a single string fixture.What changed
cvs/lib/megatron_training_lib.py:Three new module-level constants — ordered pattern chains,
[new, old]:Three new pure helpers — first non-empty match wins:
get_training_results_dictreduces to a thin wrapper: fetch the log viaphdl.exec, then_parse_training_results(output).poll_for_training_completionreplaces its inlinere.search(...)calls with_is_training_complete(output)and_has_nan_inf_results(output).cvs/lib/unittests/test_megatron_training_lib.py— append two test classes (4 tests total):TestTrainingLogParsing.test_parse_new_format— parser extracts every metric from new-format input.TestTrainingLogParsing.test_parse_old_format_falls_back— parser extracts every metric from old-format input.TestProgressDetection.test_handles_new_formatandtest_handles_old_format— exercise the separateTRAINING_PROGRESS_PATTERNSchain (different code path from parsing).