Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to configure kubeshare-config.yaml? #22

Closed
jungyh0218 opened this issue Nov 9, 2022 · 1 comment
Closed

How to configure kubeshare-config.yaml? #22

jungyh0218 opened this issue Nov 9, 2022 · 1 comment

Comments

@jungyh0218
Copy link

Hello. I recently converted from KubeShare 1.0 ver to KubeShare2.0 and it seems like many major factors are changed in the 2.0 version.
I try to use 2.0 but the pod always fails to be scheduled. I guess the cause of the issue is misconfiguration of kubeshare-config.yaml file.
When I command 'kubectl describe pod ', I get the error message like this:

Warning  FailedScheduling  9s    kubeshare-scheduler  0/3 nodes are available: 1 [Filter] Node gpu01 doesn't meet the gpu request of pod default/pod1(2c2d8706-4b74-4063-a3b4-4542ee8e7089) in Filter, totalCPUs: 0, totalMemory: 0, totalFits: 0, 1 [Filter] Node gpu02 doesn't meet the gpu request of pod default/pod1(2c2d8706-4b74-4063-a3b4-4542ee8e7089) in Filter, totalCPUs: 0, totalMemory: 0, totalFits: 0, 1 [Filter] Node mgmt01 doesn't meet the gpu request of pod default/pod1(2c2d8706-4b74-4063-a3b4-4542ee8e7089) in Filter, totalCPUs: 0, totalMemory: 0, totalFits: 0.
Warning  FailedScheduling  9s    kubeshare-scheduler  0/3 nodes are available: 1 [Filter] Node gpu01 doesn't meet the gpu request of pod default/pod1(2c2d8706-4b74-4063-a3b4-4542ee8e7089) in Filter, totalCPUs: 0, totalMemory: 0, totalFits: 0, 1 [Filter] Node gpu02 doesn't meet the gpu request of pod default/pod1(2c2d8706-4b74-4063-a3b4-4542ee8e7089) in Filter, totalCPUs: 0, totalMemory: 0, totalFits: 0, 1 [Filter] Node mgmt01 doesn't meet the gpu request of pod default/pod1(2c2d8706-4b74-4063-a3b4-4542ee8e7089) in Filter, totalCPUs: 0, totalMemory: 0, totalFits: 0.

I have a GPU cluster and here is the physical structure of the cluster.
gpu

And here is the config file I wrote. What is wrong with it?

cellTypes:
  T4-NODE:
    childCellType: "Tesla-T4"
    childCellNumber: 2

cells:
- cellType: T4-NODE
  cellChildren:
  - cellId: gpu01
  - cellId: gpu02
@justin0u0
Copy link
Contributor

Hi @jungyh0218, I think that the config file should be

cellTypes:
  T4-NODE:
    childCellType: "Tesla-T4"
    childCellNumber: 2
    isNodeLevel: true

cells:
- cellType: T4-NODE
  cellChildren:
  - cellId: gpu01
  - cellId: gpu02

And make sure that the childCellType's name is equal to what showing in the nvidia-smi -L command and convert spaces to -.

Ref: https://github.com/NTHU-LSALAB/KubeShare/blob/master/doc/deploy.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants