How to configure kubeshare-config.yaml? #22

jungyh0218 · 2022-11-09T03:01:37Z

Hello. I recently converted from KubeShare 1.0 ver to KubeShare2.0 and it seems like many major factors are changed in the 2.0 version.
I try to use 2.0 but the pod always fails to be scheduled. I guess the cause of the issue is misconfiguration of kubeshare-config.yaml file.
When I command 'kubectl describe pod ', I get the error message like this:

Warning  FailedScheduling  9s    kubeshare-scheduler  0/3 nodes are available: 1 [Filter] Node gpu01 doesn't meet the gpu request of pod default/pod1(2c2d8706-4b74-4063-a3b4-4542ee8e7089) in Filter, totalCPUs: 0, totalMemory: 0, totalFits: 0, 1 [Filter] Node gpu02 doesn't meet the gpu request of pod default/pod1(2c2d8706-4b74-4063-a3b4-4542ee8e7089) in Filter, totalCPUs: 0, totalMemory: 0, totalFits: 0, 1 [Filter] Node mgmt01 doesn't meet the gpu request of pod default/pod1(2c2d8706-4b74-4063-a3b4-4542ee8e7089) in Filter, totalCPUs: 0, totalMemory: 0, totalFits: 0.
Warning  FailedScheduling  9s    kubeshare-scheduler  0/3 nodes are available: 1 [Filter] Node gpu01 doesn't meet the gpu request of pod default/pod1(2c2d8706-4b74-4063-a3b4-4542ee8e7089) in Filter, totalCPUs: 0, totalMemory: 0, totalFits: 0, 1 [Filter] Node gpu02 doesn't meet the gpu request of pod default/pod1(2c2d8706-4b74-4063-a3b4-4542ee8e7089) in Filter, totalCPUs: 0, totalMemory: 0, totalFits: 0, 1 [Filter] Node mgmt01 doesn't meet the gpu request of pod default/pod1(2c2d8706-4b74-4063-a3b4-4542ee8e7089) in Filter, totalCPUs: 0, totalMemory: 0, totalFits: 0.

I have a GPU cluster and here is the physical structure of the cluster.

And here is the config file I wrote. What is wrong with it?

cellTypes:
  T4-NODE:
    childCellType: "Tesla-T4"
    childCellNumber: 2

cells:
- cellType: T4-NODE
  cellChildren:
  - cellId: gpu01
  - cellId: gpu02

The text was updated successfully, but these errors were encountered:

justin0u0 · 2022-11-23T13:30:50Z

Hi @jungyh0218, I think that the config file should be

cellTypes:
  T4-NODE:
    childCellType: "Tesla-T4"
    childCellNumber: 2
    isNodeLevel: true

cells:
- cellType: T4-NODE
  cellChildren:
  - cellId: gpu01
  - cellId: gpu02

And make sure that the childCellType's name is equal to what showing in the nvidia-smi -L command and convert spaces to -.

Ref: https://github.com/NTHU-LSALAB/KubeShare/blob/master/doc/deploy.md

jungyh0218 closed this as completed Dec 8, 2022

justin0u0 mentioned this issue May 2, 2023

About how to write the gpu topology #25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to configure kubeshare-config.yaml? #22

How to configure kubeshare-config.yaml? #22

jungyh0218 commented Nov 9, 2022

justin0u0 commented Nov 23, 2022

How to configure kubeshare-config.yaml? #22

How to configure kubeshare-config.yaml? #22

Comments

jungyh0218 commented Nov 9, 2022

justin0u0 commented Nov 23, 2022