-
Notifications
You must be signed in to change notification settings - Fork 42
Add RunAI scheduler support and enable NCCL tests submission #436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4922ce4 to
2b381b3
Compare
8334b79 to
fd53ca0
Compare
|
@amaslenn , Resolved all of your comments. Thanks. Tested after updating the code, and it works. |
srivatsankrishnan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for the purpose of enabling RunAI for acceptance test using CloudAI. Its a pretty long PR and need more time to discuss the design and architecture and how it affects the other core features in CloudAI such as DSE, reporting etc. Based on the time I have for review this PR, I don;t think it affects DSE yet because we don't have a use case for it with this scheduler.
Summary
This pull request aims to add RunAI support to CloudAI and enable submission of NCCL tests to RunAI via CloudAI.
I can submit a NCCL test to RunAI, monitor it, and detect job completion.
However, there is still much room for improvement.
Design Document: https://docs.google.com/document/d/1fqD_hBXmj0ikXX91iAVGvCLZk2tZdUKqK08TZdT7E0g/edit?tab=t.0#heading=h.p9oz8l1n9zma
Test Plan
Additional Notes