Skip to content

refactor(megatron_training_lib): add ordered fallback chains for log parsing [AIMVT-161]#158

Merged
atnair-amd merged 7 commits into
mainfrom
atnair/aimvt-161-parsing-fallback
May 12, 2026
Merged

refactor(megatron_training_lib): add ordered fallback chains for log parsing [AIMVT-161]#158
atnair-amd merged 7 commits into
mainfrom
atnair/aimvt-161-parsing-fallback

Conversation

@atnair-amd
Copy link
Copy Markdown
Contributor

@atnair-amd atnair-amd commented May 7, 2026

Summary

Closes [AIMVT-161]. Refactors the inline regex parsing/progress/NaN checks in cvs/lib/megatron_training_lib.py into ordered fallback chains and pure module-level helpers. Each chain is seeded with [new, old] so newer Megatron output (throughput per GPU (TFLOP/s/GPU): N) is preferred but older output (throughput per GPU: N) still parses correctly. No config or doc changes — pattern lists are intentionally library-internal.

Why

The parsing regex was edited in place when newer Megatron builds changed the output format (throughput per GPU: to throughput per GPU (TFLOP/s/GPU):), losing the ability to parse logs from the older builds. A fallback chain is the structural fix: future format changes append a new pattern instead of overwriting an old one, and old logs keep parsing.

The progress regex was also buried inside poll_for_training_completion's time.sleep(80)-then-loop machinery, which made it un-unit-testable without monkeypatching sleeps + phdl.exec + scan_for_training_errors + get_training_results_dict. Extracting pure helpers makes both the parsing and progress/NaN checks testable with a single string fixture.

What changed

cvs/lib/megatron_training_lib.py:

Three new module-level constants — ordered pattern chains, [new, old]:

TRAINING_RESULT_PATTERNS = {
    'throughput_per_gpu': [
        r'throughput per GPU(?:\s*\([^)]*\))?\s*:\s+([0-9\.]+)',
        r'throughput per GPU:\s+([0-9\.]+)',
    ],
    'tokens_per_gpu': [r'tokens\/GPU\/s:\s+([0-9]+)'],
    'mem_usage': [r'mem usages:\s+([0-9\.]+)'],
    'elapsed_time_per_iteration': [r'elapsed time per iteration \(ms\):\s+([0-9\.]+)'],
}

TRAINING_PROGRESS_PATTERNS = [
    r'throughput per GPU(?:\s*\([^)]*\))?\s*:|tokens\/GPU\/s\s+[0-9]+',
    r'throughput per GPU:|tokens\/GPU\/s\s+[0-9]+',
]

TRAINING_NAN_PATTERNS = [...]   # same shape; behavior-preserving

Three new pure helpers — first non-empty match wins:

def _parse_training_results(output: str) -> dict: ...
def _is_training_complete(output: str) -> bool: ...
def _has_nan_inf_results(output: str) -> bool: ...
  • get_training_results_dict reduces to a thin wrapper: fetch the log via phdl.exec, then _parse_training_results(output).
  • poll_for_training_completion replaces its inline re.search(...) calls with _is_training_complete(output) and _has_nan_inf_results(output).

cvs/lib/unittests/test_megatron_training_lib.py — append two test classes (4 tests total):

  • TestTrainingLogParsing.test_parse_new_format — parser extracts every metric from new-format input.
  • TestTrainingLogParsing.test_parse_old_format_falls_back — parser extracts every metric from old-format input.
  • TestProgressDetection.test_handles_new_format and test_handles_old_format — exercise the separate TRAINING_PROGRESS_PATTERNS chain (different code path from parsing).

atnair-amd added 3 commits May 6, 2026 17:13
…ble [AIMVT-160]

Makes the in-container HCA-id verification regex inside
`exec_nic_setup_scripts` configurable via a new optional `hca_id_pattern`
key in the training config (default `bnxt_|rocep`). Replaces the
previously hardcoded `bnxt_` literal so users with Mellanox/RoCE or
other RDMA NICs can extend the match (e.g. `bnxt_|rocep|mlx5_`)
without patching the lib.

Changes:
- `cvs/lib/megatron_training_lib.py`: add `hca_id_pattern` setdefault and
  thread `self.hca_id_pattern` into the `re.search` call after the
  libbnxt copy.
- `cvs/input/config_file/training/megatron/mi3xx_megatron_llama_distributed.json`:
  add `_example_hca_id_pattern` and `hca_id_pattern` keys (distributed
  only; the libbnxt workaround never runs in single-node mode).
- `docs/reference/configuration-files/megatron.rst`: update the embedded
  distributed JSON dropdown and add two rows to the distributed `config`
  parameters table.
- `cvs/lib/unittests/test_megatron_training_lib.py`: add
  `TestExecNicSetupHcaIdPattern` with two cases (default+rocep, override+mlx5)
  that together catch every plausible regression: missing setdefault,
  missing f-string interpolation, wrong default value, and stale
  hardcoded literal.

Repro:
  .test_venv/bin/python -m unittest discover -s cvs/lib/unittests \
      -p "test_megatron_training_lib.py" -v
  # expect: 3 tests (1 from PR0 + 2 new), all "ok"

Integration: run `cvs run megatron_llama3_1_8b_distributed` with
`hca_id_pattern: "bnxt_|rocep"` in `mi3xx_megatron_llama_distributed.json`;
expect identical behavior to before.
…o config [AIMVT-162]

Moves the in-container Megatron-LM root and the per-tokenizer
training-script paths out of `cvs/lib/megatron_training_lib.py` and
into the training config (new `megatron_root` and `training_scripts`
keys, both with defaults that match the previously hardcoded values).
The if/elif tokenizer-to-script chain is replaced by a config-driven
loop, so adding a new model family (e.g. `llama-4`) becomes a config
edit, not a code edit. All `cd`, `self.training_script`, and the
line-430 `sed` target now derive from these keys.

Side effect — fixes a latent bug: the line-430 `sed` previously rewrote
`train_llama3.sh` even when `self.training_script` was
`train_llama2.sh`. Now uses `self.training_script` consistently, so
the right script gets patched for both model families.

Out of scope (queued):
- `extra_ld_library_paths` for the `/usr/local/lib/` prepended at line 331.
- libbnxt source/dest paths in `exec_nic_setup_scripts` (lines 304-307).
- Raise on no-match in the `training_scripts` loop instead of silent `None`
  (preserves existing if/elif fallthrough behavior — not a new bug).

Changes:
- `cvs/lib/megatron_training_lib.py`: add `megatron_root` and
  `training_scripts` setdefaults; replace the if/elif at lines 244-247
  with a config-driven loop; thread `self.megatron_root` into
  `build_training_job_cmd` (3 sites); thread `self.training_script`
  into `start_training_job`'s sed target.
- `cvs/input/config_file/training/megatron/{mi3xx_megatron_llama_distributed,mi3xx_megatron_llama_single,mi35x_megatron_llama_single}.json`:
  add `megatron_root` and `training_scripts` keys to each.
- `docs/reference/configuration-files/megatron.rst`: update all three
  embedded JSON dropdowns and add two rows to each `config` parameters
  table (single, mi35x single, distributed).
- `cvs/lib/unittests/test_megatron_training_lib.py`: add
  `TestMegatronRootPropagation` with one test, three assertions
  (no leakage, exact-equals on training_script, cd-site substitution).

Repro:
  .test_venv/bin/python -m unittest discover -s cvs/lib/unittests \
      -p "test_megatron_training_lib.py" -v
  # expect: 4 tests passing (1 PR0 + 2 PR1 + 1 PR3)

Integration: run `cvs run megatron_llama3_1_8b_distributed` with default
config; expect identical behavior to before. Llama-2 runs now patch
`TRAIN_LOG=` on `train_llama2.sh` instead of (incorrectly)
`train_llama3.sh`.
…parsing [AIMVT-161]

Refactors the inline regex parsing/progress/NaN checks in
`cvs/lib/megatron_training_lib.py` into ordered fallback chains
(`TRAINING_RESULT_PATTERNS`, `TRAINING_PROGRESS_PATTERNS`,
`TRAINING_NAN_PATTERNS`) and pure module-level helpers
(`_parse_training_results`, `_is_training_complete`,
`_has_nan_inf_results`).

Each chain is seeded with `[new, old]` so newer Megatron output
(`throughput per GPU (TFLOP/s/GPU): N`) is preferred but older output
(`throughput per GPU: N`) still parses correctly. First non-empty match
wins; adding a third format becomes a one-line append.

The helpers are pure functions on a single string so they're testable
without `phdl`, `time.sleep(80)`, or any of `poll_for_training_completion`'s
loop machinery — `get_training_results_dict` and
`poll_for_training_completion` are now thin wrappers.

No config or doc changes — pattern lists are intentionally library-internal.

Out of scope (queued):
- Making the pattern lists configurable via the JSON config (revisit if
  a third format shows up).
- Fixing the long-standing `[NaN|Inf]` character-class bug noted in
  the existing code (separate ticket — behavior-preserving today).

Changes:
- `cvs/lib/megatron_training_lib.py`:
  - Add `TRAINING_RESULT_PATTERNS`, `TRAINING_PROGRESS_PATTERNS`,
    `TRAINING_NAN_PATTERNS` module-level constants.
  - Add `_parse_training_results`, `_is_training_complete`,
    `_has_nan_inf_results` pure helpers.
  - Refactor `get_training_results_dict` to call `_parse_training_results`.
  - Refactor `poll_for_training_completion` to call `_is_training_complete`
    and `_has_nan_inf_results` (replaces inline `re.search` calls).
- `cvs/lib/unittests/test_megatron_training_lib.py`:
  - Add `TestTrainingLogParsing` (2 cases: new-format + old-format-falls-back).
  - Add `TestProgressDetection` (2 cases: new-format + old-format).

Repro:
  .test_venv/bin/python -m unittest discover -s cvs/lib/unittests \
      -p "test_megatron_training_lib.py" -v
  # expect: 8 tests passing (1 PR0 + 2 PR1 + 1 PR3 + 4 PR2)
Copy link
Copy Markdown

@anujmittal-amd anujmittal-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the comment below along with the hca pattern comment; LGTM for merge.

Comment thread cvs/lib/megatron_training_lib.py Outdated
…aN|Inf)` [AIMVT-161]

Per review on PR #158: `[NaN|Inf]` is a regex character class matching
ONE char from {N,a,I,n,f,|}, not the strings `NaN` or `Inf`. Replace
with the alternation group `(?:NaN|Inf)` to match the intended literals
in all four `TRAINING_NAN_PATTERNS` entries.

In-practice impact: minimal today — real NaN/Inf output starts with
`N`/`I` (both in the buggy class) so true positives fired; numeric
output starts with digits (none in the class) so no false positives
on real numbers. The fix tightens the regex against false positives
on non-numeric junk starting with `a`/`n`/`f`/`|`.

Previously deferred (Out-of-scope in the prior commit); pulled into
this PR per reviewer request. Stale `# NOTE:` comment apologizing for
the bug is removed.
Add `TestNanInfDetection` exercising `_has_nan_inf_results` on:
- `... (TFLOP/s/GPU): NaN`   — new format true-positive
- `... : Inf`                — old format true-positive (fallback chain)
- `... (TFLOP/s/GPU): 612.5` — no false positive on numeric output
- `... (TFLOP/s/GPU): aaa`   — proof-of-detection: under the prior
  `[NaN|Inf]` char class this returned True (matched single char `a`);
  under the fixed `(?:NaN|Inf)` it returns False.

Cases run as `subTest` rows on a single `test_has_nan_inf_results`
method (matches the existing `cvs/unittests/test_main.py` precedent;
`pytest.mark.parametrize` doesn't work on `unittest.TestCase` methods,
and dropping `TestCase` would make the class invisible to
`run_all_unittests.py`'s `unittest.TestLoader.discover()`).

Repro:
  python -m unittest cvs.lib.unittests.test_megatron_training_lib -v
  # expect: 9 tests passing
atnair-amd added a commit that referenced this pull request May 12, 2026
… of raw interpolation [AIMVT-160]

Per @anujmittal-amd review on PR #157: the previous shape interpolated
the config-supplied `hca_id_pattern` raw into `re.search`, putting the
burden of regex correctness on the user. A typo like `mlx5+` (intending
wildcard) silently false-matched `mlx5550000`; misconfigured segments
went undetected because the failure surface was the misleading
`fail_test('Broadcom libbnxt rdma driver is not properly copied on
node ...')`, which points at the copy step rather than the config.

Treat `hca_id_pattern` as a `|`-separated list of literal NIC-name
prefixes. The lib splits on `|`, strips per-segment whitespace, escapes
each segment with `re.escape`, and rejoins with `|` to build the
verification regex. For the default value `bnxt_|rocep`:

  effective regex: `hca_id:\s+(bnxt_|rocep)`  -- byte-identical to before.

Backward-compatible by design: the `hca_id_pattern` key name and the
default value are unchanged from `844a1d3`. Existing configs from PR #157
need no migration. The contract change is invisible to users supplying
plain prefixes; only previously-broken configs (regex specials inside a
segment) now fail loudly instead of false-matching.

Changes:
- `cvs/lib/megatron_training_lib.py`: replace the inline raw-interpolation
  `re.search` with a built-then-used regex inside `exec_nic_setup_scripts`
  (single 8-line block; key name + setdefault + attribute name unchanged).
- `docs/reference/configuration-files/megatron.rst`: update the
  `hca_id_pattern` row description to spell out the contract --
  "`|`-separated list of NIC-name prefixes ... each segment is treated
  as a literal prefix (regex special chars are escaped by the lib)".

No JSON config change (defaults are byte-identical), no rename, no
backward-compat shim.

Out of scope (queued):
- Same `re.escape` treatment for `nic_type` (also a regex-string config
  key, same input-error class). Reviewer didn't flag it; separate ticket.
- Validating non-empty `hca_id_pattern` at __init__ with a helpful error.
- Rebasing PRs #158 / #159 onto this commit (mechanical follow-up).
atnair-amd added a commit that referenced this pull request May 12, 2026
…ble [AIMVT-160] (#157)

* feat(megatron_training_lib): make HCA-id verification regex configurable [AIMVT-160]

Makes the in-container HCA-id verification regex inside
`exec_nic_setup_scripts` configurable via a new optional `hca_id_pattern`
key in the training config (default `bnxt_|rocep`). Replaces the
previously hardcoded `bnxt_` literal so users with Mellanox/RoCE or
other RDMA NICs can extend the match (e.g. `bnxt_|rocep|mlx5_`)
without patching the lib.

Changes:
- `cvs/lib/megatron_training_lib.py`: add `hca_id_pattern` setdefault and
  thread `self.hca_id_pattern` into the `re.search` call after the
  libbnxt copy.
- `cvs/input/config_file/training/megatron/mi3xx_megatron_llama_distributed.json`:
  add `_example_hca_id_pattern` and `hca_id_pattern` keys (distributed
  only; the libbnxt workaround never runs in single-node mode).
- `docs/reference/configuration-files/megatron.rst`: update the embedded
  distributed JSON dropdown and add two rows to the distributed `config`
  parameters table.
- `cvs/lib/unittests/test_megatron_training_lib.py`: add
  `TestExecNicSetupHcaIdPattern` with two cases (default+rocep, override+mlx5)
  that together catch every plausible regression: missing setdefault,
  missing f-string interpolation, wrong default value, and stale
  hardcoded literal.

Repro:
  .test_venv/bin/python -m unittest discover -s cvs/lib/unittests \
      -p "test_megatron_training_lib.py" -v
  # expect: 3 tests (1 from PR0 + 2 new), all "ok"

Integration: run `cvs run megatron_llama3_1_8b_distributed` with
`hca_id_pattern: "bnxt_|rocep"` in `mi3xx_megatron_llama_distributed.json`;
expect identical behavior to before.

* fix(megatron_training_lib): re.escape hca_id_pattern segments instead of raw interpolation [AIMVT-160]

Per @anujmittal-amd review on PR #157: the previous shape interpolated
the config-supplied `hca_id_pattern` raw into `re.search`, putting the
burden of regex correctness on the user. A typo like `mlx5+` (intending
wildcard) silently false-matched `mlx5550000`; misconfigured segments
went undetected because the failure surface was the misleading
`fail_test('Broadcom libbnxt rdma driver is not properly copied on
node ...')`, which points at the copy step rather than the config.

Treat `hca_id_pattern` as a `|`-separated list of literal NIC-name
prefixes. The lib splits on `|`, strips per-segment whitespace, escapes
each segment with `re.escape`, and rejoins with `|` to build the
verification regex. For the default value `bnxt_|rocep`:

  effective regex: `hca_id:\s+(bnxt_|rocep)`  -- byte-identical to before.

Backward-compatible by design: the `hca_id_pattern` key name and the
default value are unchanged from `844a1d3`. Existing configs from PR #157
need no migration. The contract change is invisible to users supplying
plain prefixes; only previously-broken configs (regex specials inside a
segment) now fail loudly instead of false-matching.

Changes:
- `cvs/lib/megatron_training_lib.py`: replace the inline raw-interpolation
  `re.search` with a built-then-used regex inside `exec_nic_setup_scripts`
  (single 8-line block; key name + setdefault + attribute name unchanged).
- `docs/reference/configuration-files/megatron.rst`: update the
  `hca_id_pattern` row description to spell out the contract --
  "`|`-separated list of NIC-name prefixes ... each segment is treated
  as a literal prefix (regex special chars are escaped by the lib)".

No JSON config change (defaults are byte-identical), no rename, no
backward-compat shim.

Out of scope (queued):
- Same `re.escape` treatment for `nic_type` (also a regex-string config
  key, same input-error class). Reviewer didn't flag it; separate ticket.
- Validating non-empty `hca_id_pattern` at __init__ with a helpful error.
- Rebasing PRs #158 / #159 onto this commit (mechanical follow-up).

* test(megatron_training_lib): cover hca_id_pattern whitespace + regex-escape [AIMVT-160]

Add two tests to `TestExecNicSetupHcaIdPattern` exercising the hardening
from the prior commit. The existing two tests (default-rocep and
override-mlx5) are unchanged -- the new behavior is byte-identical for
their inputs.

- `test_whitespace_around_segments_is_stripped`: `'bnxt_| rocep | mlx5_'`
  with surrounding whitespace per segment must still match `rocep1s0f0`.
  Catches a refactor that drops the .strip() call from the parser.

- `test_special_regex_chars_in_segment_are_escaped`: PROOF-OF-DETECTION
  for the `re.escape` fix. Config `hca_id_pattern='mlx5+'` plus devinfo
  `hca_id:\tmlx5550000`:
    - With `re.escape`: `+` is treated as literal; emitted regex is
      `mlx5\+`; no match -> `fail_test` IS called.
    - Without `re.escape` (the prior raw-interpolation shape): `5+`
      becomes a quantifier matching `550000` -> `fail_test` is NOT called
      -> the libbnxt-copy verify silently passes on the wrong NIC.
  Verified by temporarily reverting the parser to raw-interpolation:
  `AssertionError: Expected 'fail_test' to have been called once. Called 0 times.`

Class docstring extended to note the new contract (segments are literal
prefixes, not regex fragments).

Repro:
  python -m unittest cvs.lib.unittests.test_megatron_training_lib -v
  # expect: 5 tests passing (1 smoke + 4 in TestExecNicSetupHcaIdPattern)

* feat(megatron_training_lib): fail loudly on empty hca_id_pattern segments [AIMVT-160]

Closes the residual edge case from `bef8d29`'s safety pass. After the
parse+escape pipeline, an empty `segments` list (from a config value of
`""`, `"|||"`, `"   "`, or any all-separator/whitespace input) was
producing the degenerate regex `hca_id:\s+()` which silently matches
every `ibv_devinfo` `hca_id:` line -- the libbnxt-copy verification
would pass on the wrong NIC, exactly the "input user error" failure
class the surrounding work set out to eliminate.

Add an inline guard between the segment parse and the regex build:
if zero non-empty segments, abort with `fail_test` and the offending
raw value in the message so the user can find it quickly.

Validation lives at-use, not at `__init__`. `hca_id_pattern` is only
consumed on the distributed-training + Broadcom/Thor code path, so
init-time validation would false-fail on single-node or Mellanox
configs that legitimately ignore the key.

Changes:
- `cvs/lib/megatron_training_lib.py`: 5-line guard added inside
  `exec_nic_setup_scripts` between the `segments` list comprehension
  and the `hca_id_regex` build.

* test(megatron_training_lib): cover hca_id_pattern empty-input validation [AIMVT-160]

Add `test_empty_pattern_aborts_with_fail_test` to
`TestExecNicSetupHcaIdPattern`: config `hca_id_pattern=''` must trigger
the inline guard from the prior commit and call `fail_test`.

PROOF-OF-DETECTION verified by removing the guard from
`exec_nic_setup_scripts`:
  AssertionError: Expected 'fail_test' to have been called once. Called 0 times.

Without the guard, the empty input parses to zero segments, the join
emits `hca_id:\s+()`, the empty-capture regex matches every devinfo
line, and the for-loop's `if not re.search(...)` is never True --
silently passing the libbnxt-copy verification on whatever NIC happens
to be present.

Repro:
  python -m unittest cvs.lib.unittests.test_megatron_training_lib -v
  # expect: 6 tests passing (1 smoke + 5 in TestExecNicSetupHcaIdPattern)
@atnair-amd atnair-amd merged commit d842560 into main May 12, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants