Skip to content

chwrap does not search /usr/local/sbin for nvme binary on Talos Linux #1124

@dahui

Description

@dahui

Describe the bug

Trident's chwrap binary discovery does not search /usr/local/sbin when looking for the nvme binary on the host filesystem. On Talos Linux, the nvme-cli system extension installs the binary at /usr/local/sbin/nvme. When the Trident node pod starts, chwrap chroots into /host and searches standard paths (/usr/bin, /usr/sbin, /sbin, /bin) but does not check /usr/local/sbin or /usr/local/bin.

This causes NVMe discovery to fail at node initialization:

level=warning msg="Error discovering NVMe service on host." error="failed to get NVMe driver info; exit status 2"
level=info msg="NVMe is not active on this host."

The node is registered as non-NVMe-capable, and subsequent volume attach attempts fail with hostNQN not found because the controller received an empty NQN during publish.

/usr/local/sbin and /usr/local/bin are standard FHS directories for locally installed software. Immutable Linux distributions such as Talos Linux, Flatcar, and Bottlerocket commonly use these paths for extension-installed binaries because the base /usr/bin and /usr/sbin paths are read-only.

Environment

  • Trident version: 25.02 (Helm chart 100.2602.0)
  • Trident installation flags used: helm install trident netapp-trident/trident-operator --version 100.2602.0 --create-namespace --namespace trident-operator
  • Container runtime: containerd (Talos built-in)
  • Kubernetes version: 1.32.x
  • Kubernetes orchestrator: Talos Linux 1.12.5 managed via Sidero Omni
  • Kubernetes enabled feature gates: default
  • OS: Talos Linux 1.12.5 (immutable, API-driven, no shell access)
  • NetApp backend types: ONTAP AFF, ontap-san driver with sanType: nvme
  • Other: nvme-cli v2.14 installed via Talos system extension at /usr/local/sbin/nvme. /etc/nvme/hostnqn exists and is valid on all nodes. NVMe/TCP kernel module is built-in to the Talos kernel.

To Reproduce

  1. Deploy Talos Linux 1.12.5 with the nvme-cli system extension
  2. Install Trident 25.02 via Helm
  3. Configure an ontap-san backend with sanType: nvme
  4. Create a StorageClass and PVC targeting the NVMe backend
  5. Observe that the PVC remains unbound
  6. Check Trident node pod logs for the exit status 2 and NVMe is not active messages
  7. Describe the test pod to see hostNQN not found on attach

Expected behavior

chwrap should search /usr/local/sbin and /usr/local/bin in addition to the standard paths when looking for host binaries. The nvme binary at /usr/local/sbin/nvme should be found, NVMe should be detected as active on the node, and volume provisioning should succeed.

Additional context

The Talos rootfs is immutable. The binary cannot be copied or symlinked into /usr/bin or /usr/sbin at runtime. Talos's extension validator also prohibits extensions from placing files in /usr/bin, so the fix cannot come from the Talos extension side either (attempted, rejected by validator with path "/usr/bin/nvme" is not allowed in extensions).

A parallel issue (siderolabs/extensions#1017) has been filed with Sidero Labs (Talos) requesting a validator exception, but a fix in chwrap would be more broadly useful since it would cover any binary installed under /usr/local/ on any immutable distribution.

This was also discussed in siderolabs/talos#9879 where a NetApp solutions architect encountered the same root cause.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions