feat(megatron_training_lib): make HCA-id verification regex configurable [AIMVT-160]#157
Merged
Merged
Conversation
…ble [AIMVT-160]
Makes the in-container HCA-id verification regex inside
`exec_nic_setup_scripts` configurable via a new optional `hca_id_pattern`
key in the training config (default `bnxt_|rocep`). Replaces the
previously hardcoded `bnxt_` literal so users with Mellanox/RoCE or
other RDMA NICs can extend the match (e.g. `bnxt_|rocep|mlx5_`)
without patching the lib.
Changes:
- `cvs/lib/megatron_training_lib.py`: add `hca_id_pattern` setdefault and
thread `self.hca_id_pattern` into the `re.search` call after the
libbnxt copy.
- `cvs/input/config_file/training/megatron/mi3xx_megatron_llama_distributed.json`:
add `_example_hca_id_pattern` and `hca_id_pattern` keys (distributed
only; the libbnxt workaround never runs in single-node mode).
- `docs/reference/configuration-files/megatron.rst`: update the embedded
distributed JSON dropdown and add two rows to the distributed `config`
parameters table.
- `cvs/lib/unittests/test_megatron_training_lib.py`: add
`TestExecNicSetupHcaIdPattern` with two cases (default+rocep, override+mlx5)
that together catch every plausible regression: missing setdefault,
missing f-string interpolation, wrong default value, and stale
hardcoded literal.
Repro:
.test_venv/bin/python -m unittest discover -s cvs/lib/unittests \
-p "test_megatron_training_lib.py" -v
# expect: 3 tests (1 from PR0 + 2 new), all "ok"
Integration: run `cvs run megatron_llama3_1_8b_distributed` with
`hca_id_pattern: "bnxt_|rocep"` in `mi3xx_megatron_llama_distributed.json`;
expect identical behavior to before.
|
The changes are good but want to address one things wrt regex: |
… of raw interpolation [AIMVT-160] Per @anujmittal-amd review on PR #157: the previous shape interpolated the config-supplied `hca_id_pattern` raw into `re.search`, putting the burden of regex correctness on the user. A typo like `mlx5+` (intending wildcard) silently false-matched `mlx5550000`; misconfigured segments went undetected because the failure surface was the misleading `fail_test('Broadcom libbnxt rdma driver is not properly copied on node ...')`, which points at the copy step rather than the config. Treat `hca_id_pattern` as a `|`-separated list of literal NIC-name prefixes. The lib splits on `|`, strips per-segment whitespace, escapes each segment with `re.escape`, and rejoins with `|` to build the verification regex. For the default value `bnxt_|rocep`: effective regex: `hca_id:\s+(bnxt_|rocep)` -- byte-identical to before. Backward-compatible by design: the `hca_id_pattern` key name and the default value are unchanged from `844a1d3`. Existing configs from PR #157 need no migration. The contract change is invisible to users supplying plain prefixes; only previously-broken configs (regex specials inside a segment) now fail loudly instead of false-matching. Changes: - `cvs/lib/megatron_training_lib.py`: replace the inline raw-interpolation `re.search` with a built-then-used regex inside `exec_nic_setup_scripts` (single 8-line block; key name + setdefault + attribute name unchanged). - `docs/reference/configuration-files/megatron.rst`: update the `hca_id_pattern` row description to spell out the contract -- "`|`-separated list of NIC-name prefixes ... each segment is treated as a literal prefix (regex special chars are escaped by the lib)". No JSON config change (defaults are byte-identical), no rename, no backward-compat shim. Out of scope (queued): - Same `re.escape` treatment for `nic_type` (also a regex-string config key, same input-error class). Reviewer didn't flag it; separate ticket. - Validating non-empty `hca_id_pattern` at __init__ with a helpful error. - Rebasing PRs #158 / #159 onto this commit (mechanical follow-up).
…escape [AIMVT-160]
Add two tests to `TestExecNicSetupHcaIdPattern` exercising the hardening
from the prior commit. The existing two tests (default-rocep and
override-mlx5) are unchanged -- the new behavior is byte-identical for
their inputs.
- `test_whitespace_around_segments_is_stripped`: `'bnxt_| rocep | mlx5_'`
with surrounding whitespace per segment must still match `rocep1s0f0`.
Catches a refactor that drops the .strip() call from the parser.
- `test_special_regex_chars_in_segment_are_escaped`: PROOF-OF-DETECTION
for the `re.escape` fix. Config `hca_id_pattern='mlx5+'` plus devinfo
`hca_id:\tmlx5550000`:
- With `re.escape`: `+` is treated as literal; emitted regex is
`mlx5\+`; no match -> `fail_test` IS called.
- Without `re.escape` (the prior raw-interpolation shape): `5+`
becomes a quantifier matching `550000` -> `fail_test` is NOT called
-> the libbnxt-copy verify silently passes on the wrong NIC.
Verified by temporarily reverting the parser to raw-interpolation:
`AssertionError: Expected 'fail_test' to have been called once. Called 0 times.`
Class docstring extended to note the new contract (segments are literal
prefixes, not regex fragments).
Repro:
python -m unittest cvs.lib.unittests.test_megatron_training_lib -v
# expect: 5 tests passing (1 smoke + 4 in TestExecNicSetupHcaIdPattern)
…ents [AIMVT-160] Closes the residual edge case from `bef8d29`'s safety pass. After the parse+escape pipeline, an empty `segments` list (from a config value of `""`, `"|||"`, `" "`, or any all-separator/whitespace input) was producing the degenerate regex `hca_id:\s+()` which silently matches every `ibv_devinfo` `hca_id:` line -- the libbnxt-copy verification would pass on the wrong NIC, exactly the "input user error" failure class the surrounding work set out to eliminate. Add an inline guard between the segment parse and the regex build: if zero non-empty segments, abort with `fail_test` and the offending raw value in the message so the user can find it quickly. Validation lives at-use, not at `__init__`. `hca_id_pattern` is only consumed on the distributed-training + Broadcom/Thor code path, so init-time validation would false-fail on single-node or Mellanox configs that legitimately ignore the key. Changes: - `cvs/lib/megatron_training_lib.py`: 5-line guard added inside `exec_nic_setup_scripts` between the `segments` list comprehension and the `hca_id_regex` build.
…ion [AIMVT-160] Add `test_empty_pattern_aborts_with_fail_test` to `TestExecNicSetupHcaIdPattern`: config `hca_id_pattern=''` must trigger the inline guard from the prior commit and call `fail_test`. PROOF-OF-DETECTION verified by removing the guard from `exec_nic_setup_scripts`: AssertionError: Expected 'fail_test' to have been called once. Called 0 times. Without the guard, the empty input parses to zero segments, the join emits `hca_id:\s+()`, the empty-capture regex matches every devinfo line, and the for-loop's `if not re.search(...)` is never True -- silently passing the libbnxt-copy verification on whatever NIC happens to be present. Repro: python -m unittest cvs.lib.unittests.test_megatron_training_lib -v # expect: 6 tests passing (1 smoke + 5 in TestExecNicSetupHcaIdPattern)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes [AIMVT-160]. Makes the in-container HCA-id verification regex inside
exec_nic_setup_scriptsconfigurable via a new optionalhca_id_patternkey in the training config (defaultbnxt_|rocep). Replaces the previously hardcodedbnxt_literal so users with Mellanox/RoCE or other RDMA NICs can extend the match without patching the lib.Why
exec_nic_setup_scriptsruns a Broadcom-specific libbnxt copy inside each container, then verifies the copy byre.search(r'hca_id:\s+bnxt_', ibv_devinfo_output). On clusters running different nics, that regex never matches. Today the only fix is editing the lib; this PR moves the pattern into config.What changed
cvs/lib/megatron_training_lib.pyAdd
tdict.setdefault('hca_id_pattern', 'bnxt_|rocep')next to the existingnic_typesetdefault.Store on
self.hca_id_patternand thread it into there.searchcall:cvs/input/config_file/training/megatron/mi3xx_megatron_llama_distributed.json— add_example_hca_id_pattern: "bnxt_|rocep|mlx5_"andhca_id_pattern: "bnxt_|rocep"after thenccl_ib_gid_indexblock. Distributed only — the libbnxt workaround never runs in single-node mode, so the single-node JSONs intentionally don't get the key.docs/reference/configuration-files/megatron.rst— distributed section only:configparameters table.cvs/lib/unittests/test_megatron_training_lib.py— appendTestExecNicSetupHcaIdPatternwith two cases:test_default_pattern_matches_rocep— default config +hca_id:\trocep1s0f0, assertsfail_testis not called. Catches a regression where the default value losesrocep.test_override_pattern_matches_mlx5—hca_id_pattern='mlx5_'+hca_id:\tmlx5_0, assertsfail_testis not called. Catches missingsetdefault, missing f-string interpolation, and a stale hardcoded literal.