Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] The vGPU allocatable count for a node isn't updated after cluster creation #5774

Open
noahgildersleeve opened this issue May 8, 2024 · 2 comments
Assignees
Labels
area/device-manager PCI and other host devices passthrough area/rancher Rancher related issues backport-needed/1.3.2 kind/bug Issues that are defects reported by users or that we know have reached a real release reproduce/needed Reminder to add a reproduce label and to remove this one require/doc Improvements or additions to documentation severity/needed Reminder to add a severity label and to remove this one
Milestone

Comments

@noahgildersleeve
Copy link

Describe the bug

When you create a RKE2 cluster from an upstream Rancher cluster it doesn't update the vGPU allocatable list node.status.allocatable after the cluster creation
To Reproduce
Steps to reproduce the behavior:

  1. Create 4 vGPU profiles on a node
  2. deploy a 3 node RKE2 cluster with that vGPU profile
  3. Try to deploy out another 3 node RKE2 cluster and check allocatable

Expected behavior

The allocatable vGPU count should update
Support bundle

supportbundle_314d015f-bac6-449f-a676-14fd3732e576_2024-05-08T22-16-49Z.zip

Environment

  • Harvester ISO version: v1.3.0
  • Rancher version: v2.8-head
  • Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): 2 node DL360

Additional context
Add any other context about the problem here.
This is the Harvester side bug for 10948

Greenshot 2024-05-03 17 55 22

@noahgildersleeve noahgildersleeve added kind/bug Issues that are defects reported by users or that we know have reached a real release area/rancher Rancher related issues reproduce/needed Reminder to add a reproduce label and to remove this one severity/needed Reminder to add a severity label and to remove this one area/pci-devices labels May 8, 2024
@bk201 bk201 added the require/doc Improvements or additions to documentation label May 9, 2024
@bk201 bk201 added this to the v1.4.0 milestone May 9, 2024
@ibrokethecloud
Copy link
Contributor

we should improve the device plugin pcidevices controller to monitor resource usage: https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#monitoring-device-plugin-resources

additional changes may be needed in the harvester api to federate plugin info from all nodes for rancher to leverage

@harvesterhci-io-github-bot

added backport-needed/1.3.2 issue: #5777.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/device-manager PCI and other host devices passthrough area/rancher Rancher related issues backport-needed/1.3.2 kind/bug Issues that are defects reported by users or that we know have reached a real release reproduce/needed Reminder to add a reproduce label and to remove this one require/doc Improvements or additions to documentation severity/needed Reminder to add a severity label and to remove this one
Projects
None yet
Development

No branches or pull requests

5 participants