-
Notifications
You must be signed in to change notification settings - Fork 227
chore: ray-sub - improve robustness #968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dff581f to
61ae2b1
Compare
terrykong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for making ray.sub better! lots of blood and sweat in this file :) left a comment
we'll test internally and then this should be good
also: @hemildesai to review and update nemo-run template
Signed-off-by: Stanislav Kirdey <stan@inflection.ai>
Signed-off-by: Stanislav Kirdey <stan@inflection.ai>
Signed-off-by: Stanislav Kirdey <stan@inflection.ai>
Signed-off-by: Stanislav Kirdey <stan@inflection.ai>
0997d26 to
037bba2
Compare
terrykong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other than my awk command, i've tested and lgtm. if addressed we can merge
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Signed-off-by: Stan Kirdey <stan@inflection.ai>
Signed-off-by: Stanislav Kirdey <stan@inflection.ai> Signed-off-by: Stan Kirdey <stan@inflection.ai> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
Signed-off-by: Stanislav Kirdey <stan@inflection.ai> Signed-off-by: Stan Kirdey <stan@inflection.ai> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Signed-off-by: Qidong Su <qidongs@nvidia.com>
Signed-off-by: Stanislav Kirdey <stan@inflection.ai> Signed-off-by: Stan Kirdey <stan@inflection.ai> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: Stanislav Kirdey <stan@inflection.ai> Signed-off-by: Stan Kirdey <stan@inflection.ai> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Signed-off-by: Qidong Su <qidongs@nvidia.com>
Signed-off-by: Stanislav Kirdey <stan@inflection.ai> Signed-off-by: Stan Kirdey <stan@inflection.ai> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
What does this PR do ?
When running on slurm, hostname resolution sometimes fails - this PR adds a few fallback methods and debugging info to improve robustness