Skip to content

FailedMount: Unable to attach or mount volumes #483

@glennswest

Description

@glennswest

Describe the bug
In a openshift 4.4.5 bare mental install with trident, we have seen several cases where a pod getting unscheduled from one worker, and scheduled to another, will get in a state where the new pod cannot attach or mount the volume.

The current work around is to restart the kubelets. This is for large scale install, and this instability is getting to be critical.

The logs look like the following:

time="2020-11-03T14:21:03Z" level=debug msg="GRPC call: /csi.v1.Node/NodeUnpublishVolume"
time="2020-11-03T14:21:03Z" level=debug msg="GRPC request: volume_id:"pvc-9455721f-3a5e-49e1-9844-713bbcb3f5a8" target_path:"/var/lib/kubelet/pods/0a61b028-214e-4074-8443-258f74b8b91b/volumes/kubernetes.iocsi/pvc-9455721f-3a5e-49e1-9844-713bbcb3f5a8/mount" "
time="2020-11-03T14:21:03Z" level=debug msg="Attempting to acquire shared lock (NodeUnpublishVolume-pvc-9455721f-3a5e-49e1-9844-713bbcb3f5a8)." lock=csi_node_server
time="2020-11-03T14:21:03Z" level=debug msg="Acquired shared lock (NodeUnpublishVolume-pvc-9455721f-3a5e-49e1-9844-713bbcb3f5a8)." lock=csi_node_server
time="2020-11-03T14:21:03Z" level=debug msg=">>>> NodeUnpublishVolume" Method=NodeUnpublishVolume Type=CSI_Node
time="2020-11-03T14:21:03Z" level=debug msg="<<<< NodeUnpublishVolume" Method=NodeUnpublishVolume Type=CSI_Node
time="2020-11-03T14:21:03Z" level=debug msg="Released shared lock (NodeUnpublishVolume-pvc-9455721f-3a5e-49e1-9844-713bbcb3f5a8)." lock=csi_node_server
time="2020-11-03T14:21:03Z" level=error msg="GRPC error: rpc error: code = Internal desc = could not check if the target path (/var/lib/kubelet/pods/0a61b028-214e-4074-8443-258f74b8b91b/volumes/kubernetes.io
csi/pvc-9455721f-3a5e-49e1-9844-713bbcb3f5a8/mount) is a directory; stat /var/lib/kubelet/pods/0a61b028-214e-4074-8443-258f74b8b91b/volumes/kubernetes.io~csi/pvc-9455721f-3a5e-49e1-9844-713bbcb3f5a8/mount: stale NFS file handle"
A clear and concise description of what the bug is.

Environment
Provide accurate information about the environment to help us reproduce the issue.

  • Trident version: [e.g. 19.10]
    20.07.01
  • Trident installation flags used: [e.g. -d -n trident --use-custom-yaml]

OpenShift 4.4.5

  • Container runtime: [e.g. Docker 19.03.1-CE]
  • Kubernetes version: [e.g. 1.15.1]
  • Kubernetes orchestrator: [e.g. OpenShift v3.11, Rancher v2.3.3]
  • Kubernetes enabled feature gates: [e.g. CSINodeInfo]
  • OS: [e.g. RHEL 7.6, Ubuntu 16.04]
    RHCOS
  • NetApp backend types: [e.g. CVS for AWS, ONTAP AFF 9.5, HCI 1.7]
  • Other:

To Reproduce
Steps to reproduce the behavior:
This appears to happen at random.
Im currently doing a specific test case of a container that switches between nodes to be able to reproduce it easier.

Expected behavior
A clear and concise description of what you expected to happen.
Pods have no trouble moving between nodes.

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions