Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot unmarshal string into Go struct field PluginCommandLineFlags.flags.plugin.deviceListStrategy of type []string #410

Open
xuzimianxzm opened this issue Jun 2, 2023 · 7 comments

Comments

@xuzimianxzm
Copy link

I think the follow configuration has a issue: the field deviceListStrategy is a array,but you provide a string. so this will be cause a issue when the init container of nvidia-device-plugin-ctr startting.

cat << EOF > /tmp/dp-example-config0.yaml
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: envvar
    deviceIDStrategy: uuid
EOF
@xuzimianxzm
Copy link
Author

and, the another related issue is in the init container of gpu-feature-discovery-init, it requires the field of deviceListStrategy is a string, not a array.

unable to load config: unable to finalize config: unable to parse config file: error parsing config file: unmarshal error: error unmarshaling JSON: while decoding JSON: json: cannot unmarshal array into Go struct field PluginCommandLineFlags.flags.plugin.deviceListStrategy of type string

@elezar
Copy link
Member

elezar commented Jun 2, 2023

Thanks @xuzimianxzm the deviceListStrategy config option was updated to be a string late in the Device Plugin's v0.14.0 release cycle and it seems the changes were never propagated to gpu-feature-discovery. This explains the error you're seeing in your second comment.

It does also seem as if we didn't implement a custom unmarshaller for the deviceListStrategy when extending this in the device plugin.

cc @cdesiniotis

Update: I have reproduced the failure in a unit tests here https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/294 and we will work on getting a fix released.

@elezar
Copy link
Member

elezar commented Jun 2, 2023

As a workaround, could you specify the deviceListStrategy using the DEVICE_LIST_STRATEGY envvar instead?

@ndacic
Copy link

ndacic commented Jun 29, 2023

@elezar what do you mean? I am facing same issue. Deploying it as a daemon set with Flux, not using helm. Should I create DEVICE_LIST_STRATEGY env variable for container, set its value to envvar and exclude deviceListStrategy: "envvar" from config map

@alekc
Copy link

alekc commented Jul 31, 2023

@ndacic this is how I solved it for me:

nodeSelector:
  nvidia.com/gpu.present: "true"
config:
  map:
    default: |-
      version: v1
      flags:
        migStrategy: "none"
        failOnInitError: true
        nvidiaDriverRoot: "/"
        plugin:
          passDeviceSpecs: false
          deviceListStrategy:
            - envvar
          deviceIDStrategy: uuid
      sharing:
        timeSlicing:
          renameByDefault: false
          resources:
          - name: nvidia.com/gpu
            replicas: 10

@elezar
Copy link
Member

elezar commented Aug 1, 2023

This issue should be addressed in the v0.14.1 release.

@ndacic please let me know if bumping the version does not address your issue so that I can better document the workaround.

@erikschul
Copy link

@elezar This is still a problem with version 0.14.3.

It fails with the official example:

version: v1
sharing:
  timeSlicing:
    resources:
    - name: nvidia.com/gpu
      replicas: 10

but it works with the example given above. Thanks @alekc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants