We have encountered an issue during the upgrade of our environment that results in specific upgraded pods being unable to mount volumes intermittently (~50% of the time).
Describe the bug
After software on a kubernetes node is upgraded (where Trident goes from 22.10 to 23.01), we encountered an issue
whereby the underlying volume is unable to mount.
This issue is encountered on a PV that is "multi-mounted" - that is, two separate pods running on the same node are accessing the same PV in filesystem mode.
Here's the specific message that is printed when the upgrade bug occurs:
requestID=f3c8025d-dad5-4cb3-84a1-ad0406182858 requestSource=CSI
time="2023-06-23T17:39:46Z" level=error msg="GRPC error: rpc
error: code = Internal desc = unable to mount device; exit status 32"
We believe we have root caused the underlying problem to a regression that was introduced by this change:
aa3e565
Environment
- Trident version: Going from 22.10 to 23.01.
- OS: Ubuntu
- NetApp backend types: OTS & ONTAP AFF
To Reproduce
Steps to reproduce the behavior:
- Set up a pair of pods that share an underlying PV. Ensure that both pods are bound to the same node.
- Upgrade Trident from 22.10 to 23.01
- Take a look at the Trident tracking information
- You'll note that there is a missing field for certain volumes:
root@node0:/var/lib/trident/tracking# cat pvc-7a90e5d5-3b73-45ec-a08c-9963fe04933c.json | jq
{
"localhost": true,
"fstype": "ext4",
"sharedTarget": true,
"LUKSEncryption": "false",
"iscsiTargetPortal": "172.0.0.14",
"iscsiPortals": [
"172.0.0.5",
"172.0.0.6",
"172.0.0.7"
],
"iscsiTargetIqn": "iqn.1992-08.com.netapp:sn.e90faeca0ff711eea04c005056acda88:vs.4",
"iscsiLunNumber": 5,
"iscsiInterface": "default",
"iscsiIgroup": "node0-b3135d6a-2cbf-4383-abea-235403b560e8",
"useCHAP": true,
"iscsiUsername": "dude-initiator",
"iscsiInitiatorSecret": "IAaPKlD6ygOf0AhC",
"iscsiTargetUsername": "dude-iscsi-target",
"iscsiTargetSecret": "ZiYqVKFGDN4ouieZ",
"VolumeTrackingInfoPath": "",
"stagingTargetPath": "/var/lib/kubelet/plugins/kubernetes.io/csi/csi.trident.netapp.io/8e2c5043cdde8eef0e3d303ef5eaacafa803b3671810b8e46bcb5e3e7fa12964/globalmount",
"publishedTargetPaths": {
"/var/lib/kubelet/pods/8a7775fe-14ae-4089-95b9-8e764cef43fc/volumes/kubernetes.io~csi/pvc-7a90e5d5-3b73-45ec-a08c-9963fe04933c/mount": {}
}
}
- When we see this bug manifest,
rawDevicePath is not populated
- This results in a
exit 32 error on the next attempt to mount the underlying PV.
Expected behavior
- On upgrade,
rawDevicePath should be present
We have encountered an issue during the upgrade of our environment that results in specific upgraded pods being unable to mount volumes intermittently (~50% of the time).
Describe the bug
After software on a kubernetes node is upgraded (where Trident goes from 22.10 to 23.01), we encountered an issue
whereby the underlying volume is unable to mount.
This issue is encountered on a PV that is "multi-mounted" - that is, two separate pods running on the same node are accessing the same PV in filesystem mode.
Here's the specific message that is printed when the upgrade bug occurs:
We believe we have root caused the underlying problem to a regression that was introduced by this change:
aa3e565
Environment
To Reproduce
Steps to reproduce the behavior:
rawDevicePathis not populatedexit 32error on the next attempt to mount the underlying PV.Expected behavior
rawDevicePathshould be present