Describe the bug
In a openshift 4.4.5 bare mental install with trident, we have seen several cases where a pod getting unscheduled from one worker, and scheduled to another, will get in a state where the new pod cannot attach or mount the volume.
The current work around is to restart the kubelets. This is for large scale install, and this instability is getting to be critical.
The logs look like the following:
time="2020-11-03T14:21:03Z" level=debug msg="GRPC call: /csi.v1.Node/NodeUnpublishVolume"
time="2020-11-03T14:21:03Z" level=debug msg="GRPC request: volume_id:"pvc-9455721f-3a5e-49e1-9844-713bbcb3f5a8" target_path:"/var/lib/kubelet/pods/0a61b028-214e-4074-8443-258f74b8b91b/volumes/kubernetes.iocsi/pvc-9455721f-3a5e-49e1-9844-713bbcb3f5a8/mount" "
time="2020-11-03T14:21:03Z" level=debug msg="Attempting to acquire shared lock (NodeUnpublishVolume-pvc-9455721f-3a5e-49e1-9844-713bbcb3f5a8)." lock=csi_node_server
time="2020-11-03T14:21:03Z" level=debug msg="Acquired shared lock (NodeUnpublishVolume-pvc-9455721f-3a5e-49e1-9844-713bbcb3f5a8)." lock=csi_node_server
time="2020-11-03T14:21:03Z" level=debug msg=">>>> NodeUnpublishVolume" Method=NodeUnpublishVolume Type=CSI_Node
time="2020-11-03T14:21:03Z" level=debug msg="<<<< NodeUnpublishVolume" Method=NodeUnpublishVolume Type=CSI_Node
time="2020-11-03T14:21:03Z" level=debug msg="Released shared lock (NodeUnpublishVolume-pvc-9455721f-3a5e-49e1-9844-713bbcb3f5a8)." lock=csi_node_server
time="2020-11-03T14:21:03Z" level=error msg="GRPC error: rpc error: code = Internal desc = could not check if the target path (/var/lib/kubelet/pods/0a61b028-214e-4074-8443-258f74b8b91b/volumes/kubernetes.iocsi/pvc-9455721f-3a5e-49e1-9844-713bbcb3f5a8/mount) is a directory; stat /var/lib/kubelet/pods/0a61b028-214e-4074-8443-258f74b8b91b/volumes/kubernetes.io~csi/pvc-9455721f-3a5e-49e1-9844-713bbcb3f5a8/mount: stale NFS file handle"
A clear and concise description of what the bug is.
Environment
Provide accurate information about the environment to help us reproduce the issue.
- Trident version: [e.g. 19.10]
20.07.01
- Trident installation flags used: [e.g. -d -n trident --use-custom-yaml]
OpenShift 4.4.5
- Container runtime: [e.g. Docker 19.03.1-CE]
- Kubernetes version: [e.g. 1.15.1]
- Kubernetes orchestrator: [e.g. OpenShift v3.11, Rancher v2.3.3]
- Kubernetes enabled feature gates: [e.g. CSINodeInfo]
- OS: [e.g. RHEL 7.6, Ubuntu 16.04]
RHCOS
- NetApp backend types: [e.g. CVS for AWS, ONTAP AFF 9.5, HCI 1.7]
- Other:
To Reproduce
Steps to reproduce the behavior:
This appears to happen at random.
Im currently doing a specific test case of a container that switches between nodes to be able to reproduce it easier.
Expected behavior
A clear and concise description of what you expected to happen.
Pods have no trouble moving between nodes.
Additional context
Add any other context about the problem here.
Describe the bug
In a openshift 4.4.5 bare mental install with trident, we have seen several cases where a pod getting unscheduled from one worker, and scheduled to another, will get in a state where the new pod cannot attach or mount the volume.
The current work around is to restart the kubelets. This is for large scale install, and this instability is getting to be critical.
The logs look like the following:
time="2020-11-03T14:21:03Z" level=debug msg="GRPC call: /csi.v1.Node/NodeUnpublishVolume"
time="2020-11-03T14:21:03Z" level=debug msg="GRPC request: volume_id:"pvc-9455721f-3a5e-49e1-9844-713bbcb3f5a8" target_path:"/var/lib/kubelet/pods/0a61b028-214e-4074-8443-258f74b8b91b/volumes/kubernetes.io
csi/pvc-9455721f-3a5e-49e1-9844-713bbcb3f5a8/mount" "csi/pvc-9455721f-3a5e-49e1-9844-713bbcb3f5a8/mount) is a directory; stat /var/lib/kubelet/pods/0a61b028-214e-4074-8443-258f74b8b91b/volumes/kubernetes.io~csi/pvc-9455721f-3a5e-49e1-9844-713bbcb3f5a8/mount: stale NFS file handle"time="2020-11-03T14:21:03Z" level=debug msg="Attempting to acquire shared lock (NodeUnpublishVolume-pvc-9455721f-3a5e-49e1-9844-713bbcb3f5a8)." lock=csi_node_server
time="2020-11-03T14:21:03Z" level=debug msg="Acquired shared lock (NodeUnpublishVolume-pvc-9455721f-3a5e-49e1-9844-713bbcb3f5a8)." lock=csi_node_server
time="2020-11-03T14:21:03Z" level=debug msg=">>>> NodeUnpublishVolume" Method=NodeUnpublishVolume Type=CSI_Node
time="2020-11-03T14:21:03Z" level=debug msg="<<<< NodeUnpublishVolume" Method=NodeUnpublishVolume Type=CSI_Node
time="2020-11-03T14:21:03Z" level=debug msg="Released shared lock (NodeUnpublishVolume-pvc-9455721f-3a5e-49e1-9844-713bbcb3f5a8)." lock=csi_node_server
time="2020-11-03T14:21:03Z" level=error msg="GRPC error: rpc error: code = Internal desc = could not check if the target path (/var/lib/kubelet/pods/0a61b028-214e-4074-8443-258f74b8b91b/volumes/kubernetes.io
A clear and concise description of what the bug is.
Environment
Provide accurate information about the environment to help us reproduce the issue.
20.07.01
OpenShift 4.4.5
RHCOS
To Reproduce
Steps to reproduce the behavior:
This appears to happen at random.
Im currently doing a specific test case of a container that switches between nodes to be able to reproduce it easier.
Expected behavior
A clear and concise description of what you expected to happen.
Pods have no trouble moving between nodes.
Additional context
Add any other context about the problem here.