Skip to content

Add VertexAIService.launch_graph_store_job #350

Merged
kmontemayor2-sc merged 20 commits intomainfrom
kmonte/launch-multipool-vai
Oct 8, 2025
Merged

Add VertexAIService.launch_graph_store_job #350
kmontemayor2-sc merged 20 commits intomainfrom
kmonte/launch-multipool-vai

Conversation

@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator

Scope of work done

Add ability to launch heterogeneous VAI clusters via new VertexAIService.launch_graph_store_job API

Where is the documentation for this feature?: In doc comments

Did you add automated tests or write a test plan? Added unit tests

Updated Changelog.md? NO

Ready for code review?: YES

@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator Author

/unit_test

@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator Author

/integration_tests

@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator Author

/e2e_test

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Oct 1, 2025

GiGL Automation

@ 17:22:07UTC : 🔄 Unit Test started.

@ 17:28:21UTC : ❌ Workflow failed.
Please check the logs for more details.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Oct 1, 2025

GiGL Automation

@ 17:22:12UTC : 🔄 Integration Test started.

@ 18:31:11UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Oct 1, 2025

GiGL Automation

@ 17:22:21UTC : 🔄 E2E Test started.

@ 18:42:28UTC : ✅ Workflow completed successfully.

@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator Author

/unit_test

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Oct 1, 2025

GiGL Automation

@ 18:54:41UTC : 🔄 Unit Test started.

@ 19:46:25UTC : ✅ Workflow completed successfully.

Comment thread python/gigl/common/services/vertex_ai.py Outdated
Comment thread python/gigl/common/services/vertex_ai.py Outdated
Comment thread python/gigl/common/services/vertex_ai.py Outdated
Comment thread python/gigl/common/services/vertex_ai.py Outdated
@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator Author

/unit_test

@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator Author

/integration_test

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Oct 3, 2025

GiGL Automation

@ 16:30:50UTC : 🔄 Unit Test started.

@ 17:23:00UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Oct 3, 2025

GiGL Automation

@ 16:30:58UTC : 🔄 Integration Test started.

@ 17:32:29UTC : ✅ Workflow completed successfully.

Copy link
Copy Markdown
Collaborator

@mkolodner-sc mkolodner-sc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Kyle, generally LGTM -- just would prefer us to use existing VAI environment vars that are autopopulated instead of creating our own here

Comment thread python/gigl/common/services/vertex_ai.py Outdated
@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator Author

Seeing as this lives in python/gigl/common/services/vertex_ai.py, I don't think we'd be kicking ourselves in the foot by relying on VAI environment variables in this file specifically.

The reliance isn't on this side, it's on the "application code" e.g. whatever code we run in these jobs). If we rely on the VAI CLUSTER_SPEC then any of that code is going to be harder to update in the future. We could expose this all through some utility but I'm not really sure what the downside to adding new env variables that we are in control of is here.

Copy link
Copy Markdown
Collaborator

@mkolodner-sc mkolodner-sc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, thanks for the clarification. I don't feel too strongly here as that makes sense to me as well -- will defer to @svij-sc here. Approving to unblock on my end

Comment thread python/gigl/common/services/vertex_ai.py Outdated
@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator Author

/unit_test

@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator Author

/integration_test

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Oct 6, 2025

GiGL Automation

@ 20:50:34UTC : 🔄 Unit Test started.

@ 21:41:16UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Oct 6, 2025

GiGL Automation

@ 20:50:39UTC : 🔄 Integration Test started.

@ 20:56:51UTC : ✅ Workflow completed successfully.

Comment thread python/gigl/common/services/vertex_ai.py Outdated
Comment thread python/gigl/common/services/vertex_ai.py Outdated
Comment thread python/gigl/common/services/vertex_ai.py Outdated
Comment thread python/tests/integration/common/services/vertex_ai_test.py
Copy link
Copy Markdown
Collaborator

@svij-sc svij-sc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for another round of comments, but seems minimal changes- althought test cycles for worker pool 2 might / switching around compute / storage cluster might take a little time.

I will go ahead and pre-emptively approve given that the rest of the comments will be address pre-merge.

@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator Author

Could we try worker pools index 0, 1 continuing to be the main training/inference processes?
With either 2, or 3 being the storage pools? (Ideally 2; the the graph store is already the “server” lane \ and maps cleanly to paradigms off Parameter Servers / Reduction Server.

Sure, we can put the compute pool first.

For whatever reason, VAI rejects it when I put anything (gpu or not) into worker pool 2. idk why.

@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator Author

For whatever reason, VAI rejects it when I put anything (gpu or not) into worker pool 2. idk why.

oh lol it randomly works now...

@kmontemayor2-sc kmontemayor2-sc marked this pull request as ready for review October 8, 2025 16:46
@kmontemayor2-sc kmontemayor2-sc added this pull request to the merge queue Oct 8, 2025
Merged via the queue into main with commit dc24ca3 Oct 8, 2025
4 checks passed
@kmontemayor2-sc kmontemayor2-sc deleted the kmonte/launch-multipool-vai branch October 8, 2025 20:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants