Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unified Training scheduler incorrectly use allocatable for free GPU resources #34

Open
Fizzbb opened this issue Dec 31, 2021 · 3 comments
Assignees

Comments

@Fizzbb
Copy link
Collaborator

Fizzbb commented Dec 31, 2021

With Nvidia GPU device plugin, the allocatable GPU is a fixed number, i.e., the number of GPUs plugged to the node. It won't update based on current usage. Therefore, when scheduler decide the target replica (gpu number) for the unified job, the available GPU needs to be calculated by listing existing job's gpu request, also should add some safe exit process (e.g., reduce replica size after fail scheduling N times, drop job after M times of failure) to avoid keeping create/delete unsuccessful pods.

@Fizzbb Fizzbb self-assigned this Jan 4, 2022
@Fizzbb
Copy link
Collaborator Author

Fizzbb commented Jan 26, 2022

when describe node, at the end it shows allocated resources, which is the correct number. could call describe node command and extract that field. After search the describe code, they also implemented by scanning all the pods on the node. So could just use the old method to scan all the running pods and their requested GPU

@Fizzbb
Copy link
Collaborator Author

Fizzbb commented Feb 1, 2022

Merged with pull request #90 .
Interesting thing about select running pods from a given node is you need to define the Index first with manager. the List MatchingField is not automatically available. Also don't use real path, may shallow with existing ones and cause irrelevant errors.
Kuberbuilder's jobOwnerKey is a reference.
kubernetes-sigs/kubebuilder#1422

@Fizzbb
Copy link
Collaborator Author

Fizzbb commented Feb 2, 2022

add one more reference for Kubernetes Field Selector, https://kubernetes.io/docs/concepts/overview/working-with-objects/field-selectors/
Supported field selectors vary by Kubernetes resource type. All resource types support the metadata.name and metadata.namespace fields. Using unsupported field selectors produces an error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

1 participant