Skip to content

feat(megatron_training_lib): make HCA-id verification regex configurable [AIMVT-160]#157

Merged
atnair-amd merged 5 commits into
mainfrom
atnair/aimvt-160-hca-id-pattern
May 12, 2026
Merged

feat(megatron_training_lib): make HCA-id verification regex configurable [AIMVT-160]#157
atnair-amd merged 5 commits into
mainfrom
atnair/aimvt-160-hca-id-pattern

Conversation

@atnair-amd
Copy link
Copy Markdown
Contributor

Summary

Closes [AIMVT-160]. Makes the in-container HCA-id verification regex inside exec_nic_setup_scripts configurable via a new optional hca_id_pattern key in the training config (default bnxt_|rocep). Replaces the previously hardcoded bnxt_ literal so users with Mellanox/RoCE or other RDMA NICs can extend the match without patching the lib.

Why

exec_nic_setup_scripts runs a Broadcom-specific libbnxt copy inside each container, then verifies the copy by re.search(r'hca_id:\s+bnxt_', ibv_devinfo_output). On clusters running different nics, that regex never matches. Today the only fix is editing the lib; this PR moves the pattern into config.

What changed

  • cvs/lib/megatron_training_lib.py

    • Add tdict.setdefault('hca_id_pattern', 'bnxt_|rocep') next to the existing nic_type setdefault.

    • Store on self.hca_id_pattern and thread it into the re.search call:

      if not re.search(rf'hca_id:\s+({self.hca_id_pattern})', out_dict[node], re.I):
  • cvs/input/config_file/training/megatron/mi3xx_megatron_llama_distributed.json — add _example_hca_id_pattern: "bnxt_|rocep|mlx5_" and hca_id_pattern: "bnxt_|rocep" after the nccl_ib_gid_index block. Distributed only — the libbnxt workaround never runs in single-node mode, so the single-node JSONs intentionally don't get the key.

  • docs/reference/configuration-files/megatron.rst — distributed section only:

    • Update the embedded JSON dropdown to match the JSON above.
    • Add two rows to the distributed config parameters table.
  • cvs/lib/unittests/test_megatron_training_lib.py — append TestExecNicSetupHcaIdPattern with two cases:

    • test_default_pattern_matches_rocep — default config + hca_id:\trocep1s0f0, asserts fail_test is not called. Catches a regression where the default value loses rocep.
    • test_override_pattern_matches_mlx5hca_id_pattern='mlx5_' + hca_id:\tmlx5_0, asserts fail_test is not called. Catches missing setdefault, missing f-string interpolation, and a stale hardcoded literal.

…ble [AIMVT-160]

Makes the in-container HCA-id verification regex inside
`exec_nic_setup_scripts` configurable via a new optional `hca_id_pattern`
key in the training config (default `bnxt_|rocep`). Replaces the
previously hardcoded `bnxt_` literal so users with Mellanox/RoCE or
other RDMA NICs can extend the match (e.g. `bnxt_|rocep|mlx5_`)
without patching the lib.

Changes:
- `cvs/lib/megatron_training_lib.py`: add `hca_id_pattern` setdefault and
  thread `self.hca_id_pattern` into the `re.search` call after the
  libbnxt copy.
- `cvs/input/config_file/training/megatron/mi3xx_megatron_llama_distributed.json`:
  add `_example_hca_id_pattern` and `hca_id_pattern` keys (distributed
  only; the libbnxt workaround never runs in single-node mode).
- `docs/reference/configuration-files/megatron.rst`: update the embedded
  distributed JSON dropdown and add two rows to the distributed `config`
  parameters table.
- `cvs/lib/unittests/test_megatron_training_lib.py`: add
  `TestExecNicSetupHcaIdPattern` with two cases (default+rocep, override+mlx5)
  that together catch every plausible regression: missing setdefault,
  missing f-string interpolation, wrong default value, and stale
  hardcoded literal.

Repro:
  .test_venv/bin/python -m unittest discover -s cvs/lib/unittests \
      -p "test_megatron_training_lib.py" -v
  # expect: 3 tests (1 from PR0 + 2 new), all "ok"

Integration: run `cvs run megatron_llama3_1_8b_distributed` with
`hca_id_pattern: "bnxt_|rocep"` in `mi3xx_megatron_llama_distributed.json`;
expect identical behavior to before.
@anujmittal-amd
Copy link
Copy Markdown

The changes are good but want to address one things wrt regex:
Can we create the hca_id_pattern based on the input from the config.json file; instead of directly inserting the pattern in the regex? This way we can avoid input user error.

… of raw interpolation [AIMVT-160]

Per @anujmittal-amd review on PR #157: the previous shape interpolated
the config-supplied `hca_id_pattern` raw into `re.search`, putting the
burden of regex correctness on the user. A typo like `mlx5+` (intending
wildcard) silently false-matched `mlx5550000`; misconfigured segments
went undetected because the failure surface was the misleading
`fail_test('Broadcom libbnxt rdma driver is not properly copied on
node ...')`, which points at the copy step rather than the config.

Treat `hca_id_pattern` as a `|`-separated list of literal NIC-name
prefixes. The lib splits on `|`, strips per-segment whitespace, escapes
each segment with `re.escape`, and rejoins with `|` to build the
verification regex. For the default value `bnxt_|rocep`:

  effective regex: `hca_id:\s+(bnxt_|rocep)`  -- byte-identical to before.

Backward-compatible by design: the `hca_id_pattern` key name and the
default value are unchanged from `844a1d3`. Existing configs from PR #157
need no migration. The contract change is invisible to users supplying
plain prefixes; only previously-broken configs (regex specials inside a
segment) now fail loudly instead of false-matching.

Changes:
- `cvs/lib/megatron_training_lib.py`: replace the inline raw-interpolation
  `re.search` with a built-then-used regex inside `exec_nic_setup_scripts`
  (single 8-line block; key name + setdefault + attribute name unchanged).
- `docs/reference/configuration-files/megatron.rst`: update the
  `hca_id_pattern` row description to spell out the contract --
  "`|`-separated list of NIC-name prefixes ... each segment is treated
  as a literal prefix (regex special chars are escaped by the lib)".

No JSON config change (defaults are byte-identical), no rename, no
backward-compat shim.

Out of scope (queued):
- Same `re.escape` treatment for `nic_type` (also a regex-string config
  key, same input-error class). Reviewer didn't flag it; separate ticket.
- Validating non-empty `hca_id_pattern` at __init__ with a helpful error.
- Rebasing PRs #158 / #159 onto this commit (mechanical follow-up).
…escape [AIMVT-160]

Add two tests to `TestExecNicSetupHcaIdPattern` exercising the hardening
from the prior commit. The existing two tests (default-rocep and
override-mlx5) are unchanged -- the new behavior is byte-identical for
their inputs.

- `test_whitespace_around_segments_is_stripped`: `'bnxt_| rocep | mlx5_'`
  with surrounding whitespace per segment must still match `rocep1s0f0`.
  Catches a refactor that drops the .strip() call from the parser.

- `test_special_regex_chars_in_segment_are_escaped`: PROOF-OF-DETECTION
  for the `re.escape` fix. Config `hca_id_pattern='mlx5+'` plus devinfo
  `hca_id:\tmlx5550000`:
    - With `re.escape`: `+` is treated as literal; emitted regex is
      `mlx5\+`; no match -> `fail_test` IS called.
    - Without `re.escape` (the prior raw-interpolation shape): `5+`
      becomes a quantifier matching `550000` -> `fail_test` is NOT called
      -> the libbnxt-copy verify silently passes on the wrong NIC.
  Verified by temporarily reverting the parser to raw-interpolation:
  `AssertionError: Expected 'fail_test' to have been called once. Called 0 times.`

Class docstring extended to note the new contract (segments are literal
prefixes, not regex fragments).

Repro:
  python -m unittest cvs.lib.unittests.test_megatron_training_lib -v
  # expect: 5 tests passing (1 smoke + 4 in TestExecNicSetupHcaIdPattern)
…ents [AIMVT-160]

Closes the residual edge case from `bef8d29`'s safety pass. After the
parse+escape pipeline, an empty `segments` list (from a config value of
`""`, `"|||"`, `"   "`, or any all-separator/whitespace input) was
producing the degenerate regex `hca_id:\s+()` which silently matches
every `ibv_devinfo` `hca_id:` line -- the libbnxt-copy verification
would pass on the wrong NIC, exactly the "input user error" failure
class the surrounding work set out to eliminate.

Add an inline guard between the segment parse and the regex build:
if zero non-empty segments, abort with `fail_test` and the offending
raw value in the message so the user can find it quickly.

Validation lives at-use, not at `__init__`. `hca_id_pattern` is only
consumed on the distributed-training + Broadcom/Thor code path, so
init-time validation would false-fail on single-node or Mellanox
configs that legitimately ignore the key.

Changes:
- `cvs/lib/megatron_training_lib.py`: 5-line guard added inside
  `exec_nic_setup_scripts` between the `segments` list comprehension
  and the `hca_id_regex` build.
…ion [AIMVT-160]

Add `test_empty_pattern_aborts_with_fail_test` to
`TestExecNicSetupHcaIdPattern`: config `hca_id_pattern=''` must trigger
the inline guard from the prior commit and call `fail_test`.

PROOF-OF-DETECTION verified by removing the guard from
`exec_nic_setup_scripts`:
  AssertionError: Expected 'fail_test' to have been called once. Called 0 times.

Without the guard, the empty input parses to zero segments, the join
emits `hca_id:\s+()`, the empty-capture regex matches every devinfo
line, and the for-loop's `if not re.search(...)` is never True --
silently passing the libbnxt-copy verification on whatever NIC happens
to be present.

Repro:
  python -m unittest cvs.lib.unittests.test_megatron_training_lib -v
  # expect: 6 tests passing (1 smoke + 5 in TestExecNicSetupHcaIdPattern)
Copy link
Copy Markdown

@anujmittal-amd anujmittal-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@atnair-amd atnair-amd merged commit b5129b7 into main May 12, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants