-
Notifications
You must be signed in to change notification settings - Fork 150
Bump MI355X SLURM time-limit to 300m; retrigger dsv4-fp8-mi355x-sglang #1147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+9
−1
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1786,3 +1786,11 @@ | |
| - "Launch args match the provided vllm serve command, including FP4 indexer cache, FULL_AND_PIECEWISE cudagraph config, and max-num-batched-tokens 2048" | ||
| - "1k1k uses --max-model-len 4096; 8k1k uses the workflow-provided benchmark context length" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1144 | ||
|
|
||
| - config-keys: | ||
| - dsv4-fp8-mi355x-sglang | ||
| description: | ||
| - "Bump MI355X SLURM allocation from --time=180 to --time=300 in runners/launch_mi355x-amds.sh" | ||
| - "DSv4-Pro on MI355X exceeded the 3h cap (STEP CANCELLED DUE TO TIME LIMIT) due to ~30min MoE JIT compile plus slow torch-fallback kernels (SGLANG_HACK_FLASHMLA_BACKEND=torch et al.) from sgl-project/sglang#23608" | ||
| - "300 minutes matches the GH Actions outer timeout-minutes cap in benchmark-tmpl.yml" | ||
| - "Retriggering dsv4-fp8-mi355x-sglang" | ||
|
Check failure on line 1796 in perf-changelog.yaml
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 The new changelog entry for
dsv4-fp8-mi355x-sglangat perf-changelog.yaml:1790-1796 is missing the requiredpr-linkfield. The pydanticChangelogEntrymodel inutils/matrix_logic/validation.py:338-345declarespr_linkas required with no default and usesextra="forbid", soutils/process_changelog.pywill raise a ValidationError when this PR's changelog is processed, breaking the run-sweep CI this PR exists to retrigger. Fix by addingpr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1147after the description block.Extended reasoning...
What the bug is
The changelog entry added by this PR at
perf-changelog.yaml:1789-1796registers a performance-relevant change againstdsv4-fp8-mi355x-sglangbut omits thepr-linkfield. Every one of the other 204 entries in this file supplies apr-linkline, even when the PR number is not yet known (entries use placeholders likepull/XXX,pull/XXXX, orpull/TBD). Omitting it entirely is a first.Why this breaks validation
In
utils/matrix_logic/validation.py:338-345:pr_linkhas no default value, so pydantic treats it as a required field. Whenprocess_changelog.pycallsChangelogEntry.model_validate(entry_data)on the added entry, pydantic will raise:ValidationError: 1 validation error for ChangelogEntrypr-link: Field required [type=missing, loc=('pr-link',), ...]The code path that triggers it
.github/workflows/run-sweep.ymlinvokesutils/process_changelog.pyon every PR that touchesperf-changelog.yaml. That script callsget_added_lines(base_ref, head_ref, 'perf-changelog.yaml')and runsChangelogEntry.model_validate(entry_data)over every added entry. Since the new entry is exactly what was added, validation will fail on this PR.Impact
This is ironic because the whole point of this PR is to retrigger
dsv4-fp8-mi355x-sglangvia run-sweep — but run-sweep's changelog-validation step will fail before it can retrigger anything. The--time=300fix torunners/launch_mi355x-amds.shis correct and unaffected, but the retrigger won't land until the changelog entry is valid.Step-by-step proof
python3 -c "import yaml; from utils.matrix_logic.validation import ChangelogEntry; d = yaml.safe_load(open('perf-changelog.yaml')); ChangelogEntry.model_validate(d[-1])"1 validation error for ChangelogEntry / pr-link / Field required [type=missing, ...]..github/workflows/run-sweep.yml) runningutils/process_changelog.pyon the diff between base and head will hit the same error on this entry.How to fix
Append a
pr-linkline at the end of the new entry, matching the style of every other entry, pointing to this PR:If the PR number is uncertain at this stage, using a placeholder like
pull/XXXX(as other entries in the file do) would also satisfy the required-string constraint and let CI proceed.