Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't attach volume when the trident DaemonSet can't connect to trident-csi for the first time. #283

Closed
inajob opened this issue Sep 3, 2019 · 8 comments

Comments

@inajob
Copy link

inajob commented Sep 3, 2019

Trident 19.07.0 with CRD can't attach volume when the trident DaemonSet can't connect to trident-csi for the first time.

I think trident DaemonSet pod must crash(or retry forever) when it cant't register itself.

The details are as below.

(I modified the node name my-worker1 for security reasons)


I found below pods in my cluster which mount Trident's Volume.

$ kubectl get pods
NAME            READY   STATUS              RESTARTS   AGE
pod-pvc-block   0/1     ContainerCreating   0          25m
pod-pvc-file    0/1     ContainerCreating   0          58m

I checked trident's log.

trident-csi

$ kubectl logs -n trident trident-csi-cbfc4489f-t6mlf trident-main -f
time="2019-09-03T07:25:18Z" level=error msg="Node info not found." node=my-worker1
time="2019-09-03T07:25:18Z" level=error msg="GRPC error: rpc error: code = NotFound desc = node my-worker1 was not found"

This log said node Not Found but the node existed.

$ kubectl get nodes my-worker1
NAME                                           STATUS   ROLES    AGE    VERSION
my-worker1   Ready    <none>   168m   v1.15.1

I found there are no tridentnodes.

$ kubectl get tridentnodes.trident.netapp.io --all-namespaces
No resources found.

I checked DaemonSet Pod.

$ kubectl logs -n trident trident-csi-pz5w7 trident-main
time="2019-09-03T04:43:34Z" level=info msg="Running Trident storage orchestrator." binary=/usr/local/bin/trident_orchestrator build_time="Wed Jul 31 14:23:00 UTC 2019" version=19.07.0
time="2019-09-03T04:43:34Z" level=info msg="Initializing plain CSI helper frontend."
time="2019-09-03T04:43:34Z" level=info msg="Added frontend." name=plain_csi_helper
time="2019-09-03T04:43:34Z" level=info msg="Initializing CSI frontend." name=my-worker1 version=19.07.0
time="2019-09-03T04:43:34Z" level=info msg="Enabling node service capability." capability=STAGE_UNSTAGE_VOLUME
time="2019-09-03T04:43:34Z" level=info msg="Enabling volume access mode." mode=SINGLE_NODE_WRITER
time="2019-09-03T04:43:34Z" level=info msg="Enabling volume access mode." mode=SINGLE_NODE_READER_ONLY
time="2019-09-03T04:43:34Z" level=info msg="Enabling volume access mode." mode=MULTI_NODE_READER_ONLY
time="2019-09-03T04:43:34Z" level=info msg="Enabling volume access mode." mode=MULTI_NODE_SINGLE_WRITER
time="2019-09-03T04:43:34Z" level=info msg="Enabling volume access mode." mode=MULTI_NODE_MULTI_WRITER
time="2019-09-03T04:43:34Z" level=info msg="Added frontend." name=csi
time="2019-09-03T04:43:34Z" level=warning msg="Couldn't retrieve volume transaction logs: Unable to find key"
time="2019-09-03T04:43:34Z" level=info msg="Trident bootstrapped successfully."
time="2019-09-03T04:43:34Z" level=info msg="Activating plain CSI helper frontend."
time="2019-09-03T04:43:34Z" level=info msg="Activating CSI frontend."
time="2019-09-03T04:43:34Z" level=info msg="Listening for GRPC connections." name=/plugin/csi.sock net=unix
time="2019-09-03T04:45:38Z" level=error msg="Could not communicate with controller after 90.00 seconds; could not log into the Trident CSI Controller: error communicating with Trident CSI Controller; Get https://trident-csi:34571/trident/v1/node: dial tcp 10.27.3.141:34571: connect: connection refused" node=my-worker1
time="2019-09-03T04:45:39Z" level=error msg="Error registering node my-worker1 with controller; could not log into the Trident CSI Controller: error communicating with Trident CSI Controller; Put https://trident-csi:34571/trident/v1/node/my-worker1: dial tcp 10.27.3.141:34571: connect: connection refused"

This was the cause of this trouble!

I delete the pod.

$ kubectl delete pods -n trident trident-csi-pz5w7
pod "trident-csi-pz5w7" deleted

and checked re-created Pod works well.

$ kubectl logs -n trident trident-csi-s4bwv trident-main
time="2019-09-03T07:49:25Z" level=info msg="Running Trident storage orchestrator." binary=/usr/local/bin/trident_orchestrator build_time="Wed Jul 31 14:23:00 UTC 2019" version=19.07.0
time="2019-09-03T07:49:25Z" level=info msg="Initializing plain CSI helper frontend."
time="2019-09-03T07:49:25Z" level=info msg="Added frontend." name=plain_csi_helper
time="2019-09-03T07:49:25Z" level=info msg="Initializing CSI frontend." name=my-worker1 version=19.07.0
time="2019-09-03T07:49:25Z" level=info msg="Enabling node service capability." capability=STAGE_UNSTAGE_VOLUME
time="2019-09-03T07:49:25Z" level=info msg="Enabling volume access mode." mode=SINGLE_NODE_WRITER
time="2019-09-03T07:49:25Z" level=info msg="Enabling volume access mode." mode=SINGLE_NODE_READER_ONLY
time="2019-09-03T07:49:25Z" level=info msg="Enabling volume access mode." mode=MULTI_NODE_READER_ONLY
time="2019-09-03T07:49:25Z" level=info msg="Enabling volume access mode." mode=MULTI_NODE_SINGLE_WRITER
time="2019-09-03T07:49:25Z" level=info msg="Enabling volume access mode." mode=MULTI_NODE_MULTI_WRITER
time="2019-09-03T07:49:25Z" level=info msg="Added frontend." name=csi
time="2019-09-03T07:49:25Z" level=warning msg="Couldn't retrieve volume transaction logs: Unable to find key"
time="2019-09-03T07:49:25Z" level=info msg="Trident bootstrapped successfully."
time="2019-09-03T07:49:25Z" level=info msg="Activating plain CSI helper frontend."
time="2019-09-03T07:49:25Z" level=info msg="Activating CSI frontend."
time="2019-09-03T07:49:25Z" level=info msg="Listening for GRPC connections." name=/plugin/csi.sock net=unix

checked tridentnodes.

$ kubectl get tridentnodes.trident.netapp.io --all-namespaces
NAMESPACE   NAME         AGE
trident     my-worker1   67s

my pods working !

$ kubectl get pods  -w
NAME            READY   STATUS    RESTARTS   AGE
pod-pvc-block   1/1     Running   0          32m
pod-pvc-file    1/1     Running   0          65m

I think trident DaemonSet pod must crash(or retry forever) when it cant't register itself.

@scaleoutsean
Copy link
Contributor

I saw this with K8S 1.15.3 + T 19.07.0 last week

@travisghansen
Copy link

Does the fix also resolve scenarios where connection is lost after some time and not just on startup? I've now experience both issues.

@clintonk
Copy link
Contributor

@travisghansen Each CSI Trident node (i.e. each instance of the daemonset) must contact the Trident controller so that the controller knows the node info such as its name and iSCSI IQN. As long as that occurs at least once, it shouldn't matter if a node comes and goes any time after that.

@travisghansen
Copy link

travisghansen commented Oct 16, 2019

@clintonk understood. Something else must be up then. I initially ran into this issue when the trident pod was starting up. I just had another issue where a (non trident) pod never started because the trident pod on the node lost it's connection to the controller after running for several days. The pod was sitting in an init state indefinitely because the mount never happened for the pv. This was for nfs but I get your message about iscsi/iqn as well. So it must matter in some shape or form after the fact. I simply deleted the trident pod on the node and then everything moved on as expected.

I essentially want to make sure that it continually monitors and tries to connect over time.

@travisghansen
Copy link

travisghansen commented Oct 16, 2019

Also note, if I understand correctly the iscsi/iqn stuff happens at attach time so a connection to the controller would still need to function beyond startup of the pod. I may have misunderstood how/when that info got added though. Also that assumes that the node pods don't make those api calls but rather the controller...which may not be the case, not sure.

@clintonk
Copy link
Contributor

@travisghansen Prior to this fix, if the CSI Trident node was restarted for whatever reason, and even if it had previously registered with the controller, if it could not contact the controller after the most recent restart, then it would give up after 90 seconds and no longer respond to attachment requests. This fix continues trying the registration indefinitely until it succeeds, and it does not block other node operations while doing so.

FWIW, while attaching a volume to a pod, a CSI provisioner is first called at the controller followed by the node. The controller must know the node details, because only the controller interacts with the storage management interfaces, and knowing (for example) the node IQN allows us to update igroups as needed at attach time.

@travisghansen
Copy link

@clintonk until it succeeds is what I'm concerned about. What happens after it succeeds and then fails at some point? Will it start a reconnect sequence in that scenario as well (or exit the pod so a new gets recreated I suppose)?

@travisghansen
Copy link

After discussion on slack with @clintonk it became clear I misread some logs. Connection after startup is irrelevant. Thanks for the help! Looking forward to the fix.

netapp-ci pushed a commit that referenced this issue Apr 24, 2020
brackets for datalif, required for mounting when specified by user
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants