Can't attach volume when the trident DaemonSet can't connect to trident-csi for the first time. #283

inajob · 2019-09-03T08:49:25Z

Trident 19.07.0 with CRD can't attach volume when the trident DaemonSet can't connect to trident-csi for the first time.

I think trident DaemonSet pod must crash(or retry forever) when it cant't register itself.

The details are as below.

(I modified the node name my-worker1 for security reasons)

I found below pods in my cluster which mount Trident's Volume.

$ kubectl get pods
NAME            READY   STATUS              RESTARTS   AGE
pod-pvc-block   0/1     ContainerCreating   0          25m
pod-pvc-file    0/1     ContainerCreating   0          58m

I checked trident's log.

trident-csi

$ kubectl logs -n trident trident-csi-cbfc4489f-t6mlf trident-main -f
time="2019-09-03T07:25:18Z" level=error msg="Node info not found." node=my-worker1
time="2019-09-03T07:25:18Z" level=error msg="GRPC error: rpc error: code = NotFound desc = node my-worker1 was not found"

This log said node Not Found but the node existed.

$ kubectl get nodes my-worker1
NAME                                           STATUS   ROLES    AGE    VERSION
my-worker1   Ready    <none>   168m   v1.15.1

I found there are no tridentnodes.

$ kubectl get tridentnodes.trident.netapp.io --all-namespaces
No resources found.

I checked DaemonSet Pod.

$ kubectl logs -n trident trident-csi-pz5w7 trident-main
time="2019-09-03T04:43:34Z" level=info msg="Running Trident storage orchestrator." binary=/usr/local/bin/trident_orchestrator build_time="Wed Jul 31 14:23:00 UTC 2019" version=19.07.0
time="2019-09-03T04:43:34Z" level=info msg="Initializing plain CSI helper frontend."
time="2019-09-03T04:43:34Z" level=info msg="Added frontend." name=plain_csi_helper
time="2019-09-03T04:43:34Z" level=info msg="Initializing CSI frontend." name=my-worker1 version=19.07.0
time="2019-09-03T04:43:34Z" level=info msg="Enabling node service capability." capability=STAGE_UNSTAGE_VOLUME
time="2019-09-03T04:43:34Z" level=info msg="Enabling volume access mode." mode=SINGLE_NODE_WRITER
time="2019-09-03T04:43:34Z" level=info msg="Enabling volume access mode." mode=SINGLE_NODE_READER_ONLY
time="2019-09-03T04:43:34Z" level=info msg="Enabling volume access mode." mode=MULTI_NODE_READER_ONLY
time="2019-09-03T04:43:34Z" level=info msg="Enabling volume access mode." mode=MULTI_NODE_SINGLE_WRITER
time="2019-09-03T04:43:34Z" level=info msg="Enabling volume access mode." mode=MULTI_NODE_MULTI_WRITER
time="2019-09-03T04:43:34Z" level=info msg="Added frontend." name=csi
time="2019-09-03T04:43:34Z" level=warning msg="Couldn't retrieve volume transaction logs: Unable to find key"
time="2019-09-03T04:43:34Z" level=info msg="Trident bootstrapped successfully."
time="2019-09-03T04:43:34Z" level=info msg="Activating plain CSI helper frontend."
time="2019-09-03T04:43:34Z" level=info msg="Activating CSI frontend."
time="2019-09-03T04:43:34Z" level=info msg="Listening for GRPC connections." name=/plugin/csi.sock net=unix
time="2019-09-03T04:45:38Z" level=error msg="Could not communicate with controller after 90.00 seconds; could not log into the Trident CSI Controller: error communicating with Trident CSI Controller; Get https://trident-csi:34571/trident/v1/node: dial tcp 10.27.3.141:34571: connect: connection refused" node=my-worker1
time="2019-09-03T04:45:39Z" level=error msg="Error registering node my-worker1 with controller; could not log into the Trident CSI Controller: error communicating with Trident CSI Controller; Put https://trident-csi:34571/trident/v1/node/my-worker1: dial tcp 10.27.3.141:34571: connect: connection refused"

This was the cause of this trouble!

I delete the pod.

$ kubectl delete pods -n trident trident-csi-pz5w7
pod "trident-csi-pz5w7" deleted

and checked re-created Pod works well.

$ kubectl logs -n trident trident-csi-s4bwv trident-main
time="2019-09-03T07:49:25Z" level=info msg="Running Trident storage orchestrator." binary=/usr/local/bin/trident_orchestrator build_time="Wed Jul 31 14:23:00 UTC 2019" version=19.07.0
time="2019-09-03T07:49:25Z" level=info msg="Initializing plain CSI helper frontend."
time="2019-09-03T07:49:25Z" level=info msg="Added frontend." name=plain_csi_helper
time="2019-09-03T07:49:25Z" level=info msg="Initializing CSI frontend." name=my-worker1 version=19.07.0
time="2019-09-03T07:49:25Z" level=info msg="Enabling node service capability." capability=STAGE_UNSTAGE_VOLUME
time="2019-09-03T07:49:25Z" level=info msg="Enabling volume access mode." mode=SINGLE_NODE_WRITER
time="2019-09-03T07:49:25Z" level=info msg="Enabling volume access mode." mode=SINGLE_NODE_READER_ONLY
time="2019-09-03T07:49:25Z" level=info msg="Enabling volume access mode." mode=MULTI_NODE_READER_ONLY
time="2019-09-03T07:49:25Z" level=info msg="Enabling volume access mode." mode=MULTI_NODE_SINGLE_WRITER
time="2019-09-03T07:49:25Z" level=info msg="Enabling volume access mode." mode=MULTI_NODE_MULTI_WRITER
time="2019-09-03T07:49:25Z" level=info msg="Added frontend." name=csi
time="2019-09-03T07:49:25Z" level=warning msg="Couldn't retrieve volume transaction logs: Unable to find key"
time="2019-09-03T07:49:25Z" level=info msg="Trident bootstrapped successfully."
time="2019-09-03T07:49:25Z" level=info msg="Activating plain CSI helper frontend."
time="2019-09-03T07:49:25Z" level=info msg="Activating CSI frontend."
time="2019-09-03T07:49:25Z" level=info msg="Listening for GRPC connections." name=/plugin/csi.sock net=unix

checked tridentnodes.

$ kubectl get tridentnodes.trident.netapp.io --all-namespaces
NAMESPACE   NAME         AGE
trident     my-worker1   67s

my pods working !

$ kubectl get pods  -w
NAME            READY   STATUS    RESTARTS   AGE
pod-pvc-block   1/1     Running   0          32m
pod-pvc-file    1/1     Running   0          65m

I think trident DaemonSet pod must crash(or retry forever) when it cant't register itself.

The text was updated successfully, but these errors were encountered:

scaleoutsean · 2019-09-03T12:38:18Z

I saw this with K8S 1.15.3 + T 19.07.0 last week

travisghansen · 2019-10-16T18:30:31Z

Does the fix also resolve scenarios where connection is lost after some time and not just on startup? I've now experience both issues.

clintonk · 2019-10-16T18:40:26Z

@travisghansen Each CSI Trident node (i.e. each instance of the daemonset) must contact the Trident controller so that the controller knows the node info such as its name and iSCSI IQN. As long as that occurs at least once, it shouldn't matter if a node comes and goes any time after that.

travisghansen · 2019-10-16T19:01:36Z

@clintonk understood. Something else must be up then. I initially ran into this issue when the trident pod was starting up. I just had another issue where a (non trident) pod never started because the trident pod on the node lost it's connection to the controller after running for several days. The pod was sitting in an init state indefinitely because the mount never happened for the pv. This was for nfs but I get your message about iscsi/iqn as well. So it must matter in some shape or form after the fact. I simply deleted the trident pod on the node and then everything moved on as expected.

I essentially want to make sure that it continually monitors and tries to connect over time.

travisghansen · 2019-10-16T19:04:44Z

Also note, if I understand correctly the iscsi/iqn stuff happens at attach time so a connection to the controller would still need to function beyond startup of the pod. I may have misunderstood how/when that info got added though. Also that assumes that the node pods don't make those api calls but rather the controller...which may not be the case, not sure.

clintonk · 2019-10-16T19:16:47Z

@travisghansen Prior to this fix, if the CSI Trident node was restarted for whatever reason, and even if it had previously registered with the controller, if it could not contact the controller after the most recent restart, then it would give up after 90 seconds and no longer respond to attachment requests. This fix continues trying the registration indefinitely until it succeeds, and it does not block other node operations while doing so.

FWIW, while attaching a volume to a pod, a CSI provisioner is first called at the controller followed by the node. The controller must know the node details, because only the controller interacts with the storage management interfaces, and knowing (for example) the node IQN allows us to update igroups as needed at attach time.

travisghansen · 2019-10-16T19:44:16Z

@clintonk until it succeeds is what I'm concerned about. What happens after it succeeds and then fails at some point? Will it start a reconnect sequence in that scenario as well (or exit the pod so a new gets recreated I suppose)?

travisghansen · 2019-10-16T20:16:22Z

After discussion on slack with @clintonk it became clear I misread some logs. Connection after startup is irrelevant. Thanks for the help! Looking forward to the fix.

brackets for datalif, required for mounting when specified by user

korenaren added bug tracked labels Sep 3, 2019

netapp-ci closed this as completed in 837d795 Sep 11, 2019

Jubayer1228 mentioned this issue Dec 16, 2019

FailedAttachVolume #323

Closed

netapp-ci pushed a commit that referenced this issue Apr 24, 2020

IPv6 dataLIF brackets (#283)

2c835a8

brackets for datalif, required for mounting when specified by user

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't attach volume when the trident DaemonSet can't connect to trident-csi for the first time. #283

Can't attach volume when the trident DaemonSet can't connect to trident-csi for the first time. #283

inajob commented Sep 3, 2019

scaleoutsean commented Sep 3, 2019

travisghansen commented Oct 16, 2019

clintonk commented Oct 16, 2019

travisghansen commented Oct 16, 2019 •

edited

travisghansen commented Oct 16, 2019 •

edited

clintonk commented Oct 16, 2019

travisghansen commented Oct 16, 2019

travisghansen commented Oct 16, 2019

Can't attach volume when the trident DaemonSet can't connect to trident-csi for the first time. #283

Can't attach volume when the trident DaemonSet can't connect to trident-csi for the first time. #283

Comments

inajob commented Sep 3, 2019

scaleoutsean commented Sep 3, 2019

travisghansen commented Oct 16, 2019

clintonk commented Oct 16, 2019

travisghansen commented Oct 16, 2019 • edited

travisghansen commented Oct 16, 2019 • edited

clintonk commented Oct 16, 2019

travisghansen commented Oct 16, 2019

travisghansen commented Oct 16, 2019

travisghansen commented Oct 16, 2019 •

edited

travisghansen commented Oct 16, 2019 •

edited