Skip to content

Conversation

@skirdey-inflection
Copy link
Contributor

What does this PR do ?

When running on slurm, hostname resolution sometimes fails - this PR adds a few fallback methods and debugging info to improve robustness

@skirdey-inflection skirdey-inflection changed the title minor changes to ray-sub to improve robustness in execution on slurm ray-sub - improve robustness Aug 22, 2025
@skirdey-inflection skirdey-inflection changed the title ray-sub - improve robustness chore: ray-sub - improve robustness Aug 22, 2025
Copy link
Contributor

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for making ray.sub better! lots of blood and sweat in this file :) left a comment

we'll test internally and then this should be good

also: @hemildesai to review and update nemo-run template

Signed-off-by: Stanislav Kirdey <stan@inflection.ai>
Signed-off-by: Stanislav Kirdey <stan@inflection.ai>
Signed-off-by: Stanislav Kirdey <stan@inflection.ai>
Signed-off-by: Stanislav Kirdey <stan@inflection.ai>
Copy link
Contributor

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other than my awk command, i've tested and lgtm. if addressed we can merge

skirdey-inflection and others added 3 commits August 22, 2025 15:40
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: Stan Kirdey <stan@inflection.ai>
@terrykong terrykong enabled auto-merge August 26, 2025 06:25
@terrykong terrykong added this pull request to the merge queue Aug 26, 2025
Merged via the queue into NVIDIA-NeMo:main with commit 65a7965 Aug 26, 2025
21 checks passed
@bogdansalyp bogdansalyp mentioned this pull request Aug 26, 2025
4 tasks
jveronvialard pushed a commit that referenced this pull request Aug 27, 2025
Signed-off-by: Stanislav Kirdey <stan@inflection.ai>
Signed-off-by: Stan Kirdey <stan@inflection.ai>
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
soodoshll pushed a commit to soodoshll/RL that referenced this pull request Aug 28, 2025
Signed-off-by: Stanislav Kirdey <stan@inflection.ai>
Signed-off-by: Stan Kirdey <stan@inflection.ai>
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: Qidong Su <qidongs@nvidia.com>
skirdey-inflection added a commit to skirdey-inflection/RL that referenced this pull request Aug 30, 2025
Signed-off-by: Stanislav Kirdey <stan@inflection.ai>
Signed-off-by: Stan Kirdey <stan@inflection.ai>
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
soodoshll pushed a commit to soodoshll/RL that referenced this pull request Sep 4, 2025
Signed-off-by: Stanislav Kirdey <stan@inflection.ai>
Signed-off-by: Stan Kirdey <stan@inflection.ai>
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: Qidong Su <qidongs@nvidia.com>
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
Signed-off-by: Stanislav Kirdey <stan@inflection.ai>
Signed-off-by: Stan Kirdey <stan@inflection.ai>
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants