Skip to content

fix: Flaky Slurm network issues#445

Merged
ko3n1g merged 3 commits intomainfrom
ko3n1g/fix/slurm-flaky-network-issues
Mar 10, 2026
Merged

fix: Flaky Slurm network issues#445
ko3n1g merged 3 commits intomainfrom
ko3n1g/fix/slurm-flaky-network-issues

Conversation

@ko3n1g
Copy link
Copy Markdown
Contributor

@ko3n1g ko3n1g commented Mar 10, 2026

Lustre sometimes reports a temporary permission error on the job-id file, causing the experiment to fail. This solves it

Signed-off-by: oliver könig <okoenig@nvidia.com>
Comment thread nemo_run/run/torchx_backend/schedulers/slurm.py Fixed
@ko3n1g ko3n1g added the r0.8.0 Cherry-pick PR to the r0.8.0 release branch label Mar 10, 2026
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g ko3n1g merged commit e7a4a03 into main Mar 10, 2026
24 checks passed
svcnvidia-nemo-ci pushed a commit that referenced this pull request Mar 10, 2026
* fix: Flaky Slurm network issues

Signed-off-by: oliver könig <okoenig@nvidia.com>

* Potential fix for code scanning alert no. 534: Illegal raise

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>

* format

Signed-off-by: oliver könig <okoenig@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
ko3n1g added a commit that referenced this pull request Mar 10, 2026
* fix: Flaky Slurm network issues



* Potential fix for code scanning alert no. 534: Illegal raise




* format



---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r0.8.0 Cherry-pick PR to the r0.8.0 release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants