Skip to content

Conversation

@knave
Copy link
Collaborator

@knave knave commented Mar 26, 2025

No description provided.

@0x5457
Copy link
Collaborator

0x5457 commented Mar 26, 2025

I believe that normally the gpupool should always be created before the gpunode. If the gpupool was deleted, then the gpunode should also be deleted.
In what situation would the gpupool list be empty while the gpunode still exists?

@0x5457 0x5457 added the go Pull requests that update go code label Mar 26, 2025
@0x5457 0x5457 requested a review from Code2Life March 26, 2025 07:18
@0x5457 0x5457 force-pushed the fix/node-controller-issue branch from b7f2742 to d2ae796 Compare March 26, 2025 09:26
@knave
Copy link
Collaborator Author

knave commented Mar 27, 2025

When the cluster is newly created, the execution order between:
​* Node Reconciler: Responsible for creating GPUNode
​* Cluster Reconciler: Responsible for creating GPUPool
is non-deterministic.
When the Node reconciler executes first and detects an empty GPUPoolList, it prematurely terminates the reconciliation loop without requeueing the request. This failure prevents GPUNode creation. I just fixed this problem.

@0x5457 0x5457 merged commit 2bb9225 into NexusGPU:main Mar 28, 2025
2 checks passed
@github-actions
Copy link

🎉 This PR is included in version 1.24.3 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go Pull requests that update go code released

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants