refactor(megatron_training_lib): add ordered fallback chains for log parsing [AIMVT-161] by atnair-amd · Pull Request #158 · ROCm/cvs

atnair-amd · 2026-05-07T00:16:33Z

Summary

Closes [AIMVT-161]. Refactors the inline regex parsing/progress/NaN checks in cvs/lib/megatron_training_lib.py into ordered fallback chains and pure module-level helpers. Each chain is seeded with [new, old] so newer Megatron output (throughput per GPU (TFLOP/s/GPU): N) is preferred but older output (throughput per GPU: N) still parses correctly. No config or doc changes — pattern lists are intentionally library-internal.

Why

The parsing regex was edited in place when newer Megatron builds changed the output format (throughput per GPU: to throughput per GPU (TFLOP/s/GPU):), losing the ability to parse logs from the older builds. A fallback chain is the structural fix: future format changes append a new pattern instead of overwriting an old one, and old logs keep parsing.

The progress regex was also buried inside poll_for_training_completion's time.sleep(80)-then-loop machinery, which made it un-unit-testable without monkeypatching sleeps + phdl.exec + scan_for_training_errors + get_training_results_dict. Extracting pure helpers makes both the parsing and progress/NaN checks testable with a single string fixture.

What changed

cvs/lib/megatron_training_lib.py:

Three new module-level constants — ordered pattern chains, [new, old]:

TRAINING_RESULT_PATTERNS = {
    'throughput_per_gpu': [
        r'throughput per GPU(?:\s*\([^)]*\))?\s*:\s+([0-9\.]+)',
        r'throughput per GPU:\s+([0-9\.]+)',
    ],
    'tokens_per_gpu': [r'tokens\/GPU\/s:\s+([0-9]+)'],
    'mem_usage': [r'mem usages:\s+([0-9\.]+)'],
    'elapsed_time_per_iteration': [r'elapsed time per iteration \(ms\):\s+([0-9\.]+)'],
}

TRAINING_PROGRESS_PATTERNS = [
    r'throughput per GPU(?:\s*\([^)]*\))?\s*:|tokens\/GPU\/s\s+[0-9]+',
    r'throughput per GPU:|tokens\/GPU\/s\s+[0-9]+',
]

TRAINING_NAN_PATTERNS = [...]   # same shape; behavior-preserving

Three new pure helpers — first non-empty match wins:

def _parse_training_results(output: str) -> dict: ...
def _is_training_complete(output: str) -> bool: ...
def _has_nan_inf_results(output: str) -> bool: ...

get_training_results_dict reduces to a thin wrapper: fetch the log via phdl.exec, then _parse_training_results(output).
poll_for_training_completion replaces its inline re.search(...) calls with _is_training_complete(output) and _has_nan_inf_results(output).

cvs/lib/unittests/test_megatron_training_lib.py — append two test classes (4 tests total):

TestTrainingLogParsing.test_parse_new_format — parser extracts every metric from new-format input.
TestTrainingLogParsing.test_parse_old_format_falls_back — parser extracts every metric from old-format input.
TestProgressDetection.test_handles_new_format and test_handles_old_format — exercise the separate TRAINING_PROGRESS_PATTERNS chain (different code path from parsing).

…ble [AIMVT-160] Makes the in-container HCA-id verification regex inside `exec_nic_setup_scripts` configurable via a new optional `hca_id_pattern` key in the training config (default `bnxt_|rocep`). Replaces the previously hardcoded `bnxt_` literal so users with Mellanox/RoCE or other RDMA NICs can extend the match (e.g. `bnxt_|rocep|mlx5_`) without patching the lib. Changes: - `cvs/lib/megatron_training_lib.py`: add `hca_id_pattern` setdefault and thread `self.hca_id_pattern` into the `re.search` call after the libbnxt copy. - `cvs/input/config_file/training/megatron/mi3xx_megatron_llama_distributed.json`: add `_example_hca_id_pattern` and `hca_id_pattern` keys (distributed only; the libbnxt workaround never runs in single-node mode). - `docs/reference/configuration-files/megatron.rst`: update the embedded distributed JSON dropdown and add two rows to the distributed `config` parameters table. - `cvs/lib/unittests/test_megatron_training_lib.py`: add `TestExecNicSetupHcaIdPattern` with two cases (default+rocep, override+mlx5) that together catch every plausible regression: missing setdefault, missing f-string interpolation, wrong default value, and stale hardcoded literal. Repro: .test_venv/bin/python -m unittest discover -s cvs/lib/unittests \ -p "test_megatron_training_lib.py" -v # expect: 3 tests (1 from PR0 + 2 new), all "ok" Integration: run `cvs run megatron_llama3_1_8b_distributed` with `hca_id_pattern: "bnxt_|rocep"` in `mi3xx_megatron_llama_distributed.json`; expect identical behavior to before.

…o config [AIMVT-162] Moves the in-container Megatron-LM root and the per-tokenizer training-script paths out of `cvs/lib/megatron_training_lib.py` and into the training config (new `megatron_root` and `training_scripts` keys, both with defaults that match the previously hardcoded values). The if/elif tokenizer-to-script chain is replaced by a config-driven loop, so adding a new model family (e.g. `llama-4`) becomes a config edit, not a code edit. All `cd`, `self.training_script`, and the line-430 `sed` target now derive from these keys. Side effect — fixes a latent bug: the line-430 `sed` previously rewrote `train_llama3.sh` even when `self.training_script` was `train_llama2.sh`. Now uses `self.training_script` consistently, so the right script gets patched for both model families. Out of scope (queued): - `extra_ld_library_paths` for the `/usr/local/lib/` prepended at line 331. - libbnxt source/dest paths in `exec_nic_setup_scripts` (lines 304-307). - Raise on no-match in the `training_scripts` loop instead of silent `None` (preserves existing if/elif fallthrough behavior — not a new bug). Changes: - `cvs/lib/megatron_training_lib.py`: add `megatron_root` and `training_scripts` setdefaults; replace the if/elif at lines 244-247 with a config-driven loop; thread `self.megatron_root` into `build_training_job_cmd` (3 sites); thread `self.training_script` into `start_training_job`'s sed target. - `cvs/input/config_file/training/megatron/{mi3xx_megatron_llama_distributed,mi3xx_megatron_llama_single,mi35x_megatron_llama_single}.json`: add `megatron_root` and `training_scripts` keys to each. - `docs/reference/configuration-files/megatron.rst`: update all three embedded JSON dropdowns and add two rows to each `config` parameters table (single, mi35x single, distributed). - `cvs/lib/unittests/test_megatron_training_lib.py`: add `TestMegatronRootPropagation` with one test, three assertions (no leakage, exact-equals on training_script, cd-site substitution). Repro: .test_venv/bin/python -m unittest discover -s cvs/lib/unittests \ -p "test_megatron_training_lib.py" -v # expect: 4 tests passing (1 PR0 + 2 PR1 + 1 PR3) Integration: run `cvs run megatron_llama3_1_8b_distributed` with default config; expect identical behavior to before. Llama-2 runs now patch `TRAIN_LOG=` on `train_llama2.sh` instead of (incorrectly) `train_llama3.sh`.

…parsing [AIMVT-161] Refactors the inline regex parsing/progress/NaN checks in `cvs/lib/megatron_training_lib.py` into ordered fallback chains (`TRAINING_RESULT_PATTERNS`, `TRAINING_PROGRESS_PATTERNS`, `TRAINING_NAN_PATTERNS`) and pure module-level helpers (`_parse_training_results`, `_is_training_complete`, `_has_nan_inf_results`). Each chain is seeded with `[new, old]` so newer Megatron output (`throughput per GPU (TFLOP/s/GPU): N`) is preferred but older output (`throughput per GPU: N`) still parses correctly. First non-empty match wins; adding a third format becomes a one-line append. The helpers are pure functions on a single string so they're testable without `phdl`, `time.sleep(80)`, or any of `poll_for_training_completion`'s loop machinery — `get_training_results_dict` and `poll_for_training_completion` are now thin wrappers. No config or doc changes — pattern lists are intentionally library-internal. Out of scope (queued): - Making the pattern lists configurable via the JSON config (revisit if a third format shows up). - Fixing the long-standing `[NaN|Inf]` character-class bug noted in the existing code (separate ticket — behavior-preserving today). Changes: - `cvs/lib/megatron_training_lib.py`: - Add `TRAINING_RESULT_PATTERNS`, `TRAINING_PROGRESS_PATTERNS`, `TRAINING_NAN_PATTERNS` module-level constants. - Add `_parse_training_results`, `_is_training_complete`, `_has_nan_inf_results` pure helpers. - Refactor `get_training_results_dict` to call `_parse_training_results`. - Refactor `poll_for_training_completion` to call `_is_training_complete` and `_has_nan_inf_results` (replaces inline `re.search` calls). - `cvs/lib/unittests/test_megatron_training_lib.py`: - Add `TestTrainingLogParsing` (2 cases: new-format + old-format-falls-back). - Add `TestProgressDetection` (2 cases: new-format + old-format). Repro: .test_venv/bin/python -m unittest discover -s cvs/lib/unittests \ -p "test_megatron_training_lib.py" -v # expect: 8 tests passing (1 PR0 + 2 PR1 + 1 PR3 + 4 PR2)

anujmittal-amd

Other than the comment below along with the hca pattern comment; LGTM for merge.

…aN|Inf)` [AIMVT-161] Per review on PR #158: `[NaN|Inf]` is a regex character class matching ONE char from {N,a,I,n,f,|}, not the strings `NaN` or `Inf`. Replace with the alternation group `(?:NaN|Inf)` to match the intended literals in all four `TRAINING_NAN_PATTERNS` entries. In-practice impact: minimal today — real NaN/Inf output starts with `N`/`I` (both in the buggy class) so true positives fired; numeric output starts with digits (none in the class) so no false positives on real numbers. The fix tightens the regex against false positives on non-numeric junk starting with `a`/`n`/`f`/`|`. Previously deferred (Out-of-scope in the prior commit); pulled into this PR per reviewer request. Stale `# NOTE:` comment apologizing for the bug is removed.

Add `TestNanInfDetection` exercising `_has_nan_inf_results` on: - `... (TFLOP/s/GPU): NaN` — new format true-positive - `... : Inf` — old format true-positive (fallback chain) - `... (TFLOP/s/GPU): 612.5` — no false positive on numeric output - `... (TFLOP/s/GPU): aaa` — proof-of-detection: under the prior `[NaN|Inf]` char class this returned True (matched single char `a`); under the fixed `(?:NaN|Inf)` it returns False. Cases run as `subTest` rows on a single `test_has_nan_inf_results` method (matches the existing `cvs/unittests/test_main.py` precedent; `pytest.mark.parametrize` doesn't work on `unittest.TestCase` methods, and dropping `TestCase` would make the class invisible to `run_all_unittests.py`'s `unittest.TestLoader.discover()`). Repro: python -m unittest cvs.lib.unittests.test_megatron_training_lib -v # expect: 9 tests passing

@anujmittal-amd

… of raw interpolation [AIMVT-160] Per @anujmittal-amd review on PR #157: the previous shape interpolated the config-supplied `hca_id_pattern` raw into `re.search`, putting the burden of regex correctness on the user. A typo like `mlx5+` (intending wildcard) silently false-matched `mlx5550000`; misconfigured segments went undetected because the failure surface was the misleading `fail_test('Broadcom libbnxt rdma driver is not properly copied on node ...')`, which points at the copy step rather than the config. Treat `hca_id_pattern` as a `|`-separated list of literal NIC-name prefixes. The lib splits on `|`, strips per-segment whitespace, escapes each segment with `re.escape`, and rejoins with `|` to build the verification regex. For the default value `bnxt_|rocep`: effective regex: `hca_id:\s+(bnxt_|rocep)` -- byte-identical to before. Backward-compatible by design: the `hca_id_pattern` key name and the default value are unchanged from `844a1d3`. Existing configs from PR #157 need no migration. The contract change is invisible to users supplying plain prefixes; only previously-broken configs (regex specials inside a segment) now fail loudly instead of false-matching. Changes: - `cvs/lib/megatron_training_lib.py`: replace the inline raw-interpolation `re.search` with a built-then-used regex inside `exec_nic_setup_scripts` (single 8-line block; key name + setdefault + attribute name unchanged). - `docs/reference/configuration-files/megatron.rst`: update the `hca_id_pattern` row description to spell out the contract -- "`|`-separated list of NIC-name prefixes ... each segment is treated as a literal prefix (regex special chars are escaped by the lib)". No JSON config change (defaults are byte-identical), no rename, no backward-compat shim. Out of scope (queued): - Same `re.escape` treatment for `nic_type` (also a regex-string config key, same input-error class). Reviewer didn't flag it; separate ticket. - Validating non-empty `hca_id_pattern` at __init__ with a helpful error. - Rebasing PRs #158 / #159 onto this commit (mechanical follow-up).

@anujmittal-amd

…ble [AIMVT-160] (#157) * feat(megatron_training_lib): make HCA-id verification regex configurable [AIMVT-160] Makes the in-container HCA-id verification regex inside `exec_nic_setup_scripts` configurable via a new optional `hca_id_pattern` key in the training config (default `bnxt_|rocep`). Replaces the previously hardcoded `bnxt_` literal so users with Mellanox/RoCE or other RDMA NICs can extend the match (e.g. `bnxt_|rocep|mlx5_`) without patching the lib. Changes: - `cvs/lib/megatron_training_lib.py`: add `hca_id_pattern` setdefault and thread `self.hca_id_pattern` into the `re.search` call after the libbnxt copy. - `cvs/input/config_file/training/megatron/mi3xx_megatron_llama_distributed.json`: add `_example_hca_id_pattern` and `hca_id_pattern` keys (distributed only; the libbnxt workaround never runs in single-node mode). - `docs/reference/configuration-files/megatron.rst`: update the embedded distributed JSON dropdown and add two rows to the distributed `config` parameters table. - `cvs/lib/unittests/test_megatron_training_lib.py`: add `TestExecNicSetupHcaIdPattern` with two cases (default+rocep, override+mlx5) that together catch every plausible regression: missing setdefault, missing f-string interpolation, wrong default value, and stale hardcoded literal. Repro: .test_venv/bin/python -m unittest discover -s cvs/lib/unittests \ -p "test_megatron_training_lib.py" -v # expect: 3 tests (1 from PR0 + 2 new), all "ok" Integration: run `cvs run megatron_llama3_1_8b_distributed` with `hca_id_pattern: "bnxt_|rocep"` in `mi3xx_megatron_llama_distributed.json`; expect identical behavior to before. * fix(megatron_training_lib): re.escape hca_id_pattern segments instead of raw interpolation [AIMVT-160] Per @anujmittal-amd review on PR #157: the previous shape interpolated the config-supplied `hca_id_pattern` raw into `re.search`, putting the burden of regex correctness on the user. A typo like `mlx5+` (intending wildcard) silently false-matched `mlx5550000`; misconfigured segments went undetected because the failure surface was the misleading `fail_test('Broadcom libbnxt rdma driver is not properly copied on node ...')`, which points at the copy step rather than the config. Treat `hca_id_pattern` as a `|`-separated list of literal NIC-name prefixes. The lib splits on `|`, strips per-segment whitespace, escapes each segment with `re.escape`, and rejoins with `|` to build the verification regex. For the default value `bnxt_|rocep`: effective regex: `hca_id:\s+(bnxt_|rocep)` -- byte-identical to before. Backward-compatible by design: the `hca_id_pattern` key name and the default value are unchanged from `844a1d3`. Existing configs from PR #157 need no migration. The contract change is invisible to users supplying plain prefixes; only previously-broken configs (regex specials inside a segment) now fail loudly instead of false-matching. Changes: - `cvs/lib/megatron_training_lib.py`: replace the inline raw-interpolation `re.search` with a built-then-used regex inside `exec_nic_setup_scripts` (single 8-line block; key name + setdefault + attribute name unchanged). - `docs/reference/configuration-files/megatron.rst`: update the `hca_id_pattern` row description to spell out the contract -- "`|`-separated list of NIC-name prefixes ... each segment is treated as a literal prefix (regex special chars are escaped by the lib)". No JSON config change (defaults are byte-identical), no rename, no backward-compat shim. Out of scope (queued): - Same `re.escape` treatment for `nic_type` (also a regex-string config key, same input-error class). Reviewer didn't flag it; separate ticket. - Validating non-empty `hca_id_pattern` at __init__ with a helpful error. - Rebasing PRs #158 / #159 onto this commit (mechanical follow-up). * test(megatron_training_lib): cover hca_id_pattern whitespace + regex-escape [AIMVT-160] Add two tests to `TestExecNicSetupHcaIdPattern` exercising the hardening from the prior commit. The existing two tests (default-rocep and override-mlx5) are unchanged -- the new behavior is byte-identical for their inputs. - `test_whitespace_around_segments_is_stripped`: `'bnxt_| rocep | mlx5_'` with surrounding whitespace per segment must still match `rocep1s0f0`. Catches a refactor that drops the .strip() call from the parser. - `test_special_regex_chars_in_segment_are_escaped`: PROOF-OF-DETECTION for the `re.escape` fix. Config `hca_id_pattern='mlx5+'` plus devinfo `hca_id:\tmlx5550000`: - With `re.escape`: `+` is treated as literal; emitted regex is `mlx5\+`; no match -> `fail_test` IS called. - Without `re.escape` (the prior raw-interpolation shape): `5+` becomes a quantifier matching `550000` -> `fail_test` is NOT called -> the libbnxt-copy verify silently passes on the wrong NIC. Verified by temporarily reverting the parser to raw-interpolation: `AssertionError: Expected 'fail_test' to have been called once. Called 0 times.` Class docstring extended to note the new contract (segments are literal prefixes, not regex fragments). Repro: python -m unittest cvs.lib.unittests.test_megatron_training_lib -v # expect: 5 tests passing (1 smoke + 4 in TestExecNicSetupHcaIdPattern) * feat(megatron_training_lib): fail loudly on empty hca_id_pattern segments [AIMVT-160] Closes the residual edge case from `bef8d29`'s safety pass. After the parse+escape pipeline, an empty `segments` list (from a config value of `""`, `"|||"`, `" "`, or any all-separator/whitespace input) was producing the degenerate regex `hca_id:\s+()` which silently matches every `ibv_devinfo` `hca_id:` line -- the libbnxt-copy verification would pass on the wrong NIC, exactly the "input user error" failure class the surrounding work set out to eliminate. Add an inline guard between the segment parse and the regex build: if zero non-empty segments, abort with `fail_test` and the offending raw value in the message so the user can find it quickly. Validation lives at-use, not at `__init__`. `hca_id_pattern` is only consumed on the distributed-training + Broadcom/Thor code path, so init-time validation would false-fail on single-node or Mellanox configs that legitimately ignore the key. Changes: - `cvs/lib/megatron_training_lib.py`: 5-line guard added inside `exec_nic_setup_scripts` between the `segments` list comprehension and the `hca_id_regex` build. * test(megatron_training_lib): cover hca_id_pattern empty-input validation [AIMVT-160] Add `test_empty_pattern_aborts_with_fail_test` to `TestExecNicSetupHcaIdPattern`: config `hca_id_pattern=''` must trigger the inline guard from the prior commit and call `fail_test`. PROOF-OF-DETECTION verified by removing the guard from `exec_nic_setup_scripts`: AssertionError: Expected 'fail_test' to have been called once. Called 0 times. Without the guard, the empty input parses to zero segments, the join emits `hca_id:\s+()`, the empty-capture regex matches every devinfo line, and the for-loop's `if not re.search(...)` is never True -- silently passing the libbnxt-copy verification on whatever NIC happens to be present. Repro: python -m unittest cvs.lib.unittests.test_megatron_training_lib -v # expect: 6 tests passing (1 smoke + 5 in TestExecNicSetupHcaIdPattern)

atnair-amd added 3 commits May 6, 2026 17:13

atnair-amd requested review from anujmittal-amd, cijohnson and urtiwari May 7, 2026 00:16

anujmittal-amd approved these changes May 12, 2026

View reviewed changes

Comment thread cvs/lib/megatron_training_lib.py Outdated

atnair-amd added 3 commits May 12, 2026 15:45

make fmt

34da1f6

Merge branch 'main' into atnair/aimvt-161-parsing-fallback

6175847

atnair-amd merged commit d842560 into main May 12, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(megatron_training_lib): add ordered fallback chains for log parsing [AIMVT-161]#158

refactor(megatron_training_lib): add ordered fallback chains for log parsing [AIMVT-161]#158
atnair-amd merged 7 commits into
mainfrom
atnair/aimvt-161-parsing-fallback

atnair-amd commented May 7, 2026 •

edited

Loading

Uh oh!

anujmittal-amd left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

atnair-amd commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

What changed

Uh oh!

anujmittal-amd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

atnair-amd commented May 7, 2026 •

edited

Loading