Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods get stuck in ContainerCreating state when iSCSI attaching fails #736

Closed
tksm opened this issue Jun 20, 2022 · 2 comments
Closed

Pods get stuck in ContainerCreating state when iSCSI attaching fails #736

tksm opened this issue Jun 20, 2022 · 2 comments

Comments

@tksm
Copy link

tksm commented Jun 20, 2022

Describe the bug
Pods get stuck in ContainerCreating state when iSCSI attaching fails. It is never automatically recovered once it happens.

If I understand correctly, iSCSI attaching is done by the NodeStageVolume call, and the kubelet will retry if it fails. But the trident log shows the NodeStageVolume call returned success and never be retried, although the iSCSI attaching failed. The subsequent NodePublishVolume call failed to mount and returned an internal error.

I found that the trident ignores errors other than auth errors from utils.AttachISCSIVolume(). This change seems to be introduced in 22.04.0. I think this change may cause this issue.

// Perform the login/rescan/discovery/(optionally)format, mount & get the device back in the publish info
if err := utils.AttachISCSIVolume(ctx, req.VolumeContext["internalName"], "", publishInfo); err != nil {
// Did we fail to log in?
if utils.IsAuthError(err) {
// Update CHAP info from the controller and try one more time
Logc(ctx).Warn("iSCSI login failed; will retrieve CHAP credentials from Trident controller and try again.")
if err = p.updateChapInfoFromController(ctx, req, publishInfo); err != nil {
return nil, status.Error(codes.Internal, err.Error())
}
if err = utils.AttachISCSIVolume(ctx, req.VolumeContext["internalName"], "", publishInfo); err != nil {
// Bail out no matter what as we've now tried with updated credentials
return nil, status.Error(codes.Internal, err.Error())
}
}
}

Environment
Provide accurate information about the environment to help us reproduce the issue.

  • Trident version: 22.04.0
  • Trident installation flags used: silenceAutosupport: true (Trident Operator)
  • Container runtime: containerd://1.4.13
  • Kubernetes version: v1.23.4
  • Kubernetes orchestrator: Kubernetes
  • Kubernetes enabled feature gates: none
  • OS: Ubuntu 20.04.4 LTS
  • NetApp backend types: ONTAP AFF 9.9.1P
  • Other:

To Reproduce
Steps to reproduce the behavior:

  1. Make one node unable to connect LIFs.
    • iptables -A OUTPUT -p tcp -d <LIF_ADDRESS> -j REJECT
  2. Create a StatefulSet that has an ontap-san volume on the node.
  3. The Pod gets stuck in ContainerCreating.
  4. Make the node able to connect LIFs.
    • iptables -D OUTPUT -p tcp -d <LIF_ADDRESS> -j REJECT
  5. The Pod still sticks in ContainerCreating.

Expected behavior

The trident should retry the iSCSI attaching when it fails.

Additional context

@tksm tksm added the bug label Jun 20, 2022
@gnarl gnarl added the tracked label Jun 21, 2022
@gnarl
Copy link
Contributor

gnarl commented Jul 27, 2022

This issue is fixed with commit ee934c9 and is included in the Trident 22.07 release.

@gnarl gnarl closed this as completed Jul 27, 2022
@tksm
Copy link
Author

tksm commented Aug 3, 2022

@gnarl Thank you for fixing this issue. I confirmed the issue is no longer reproduced on Trident v22.07.0. 👍

Events:
  Type     Reason                  Age                From                     Message
  ----     ------                  ----               ----                     -------
  Normal   SuccessfulAttachVolume  72s                attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-5cb4eb99-6d25-43ed-a495-91e3d51128c4"
  Warning  FailedMount             15s (x3 over 47s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-5cb4eb99-6d25-43ed-a495-91e3d51128c4" : rpc error: code = Internal desc = failed to stage volume: iSCSI login failed
# I confirmed the pod would mount the volume by retrying after making the node able to connect LIFs.
  Normal   Pulling                 11s                kubelet                  Pulling image "nginx"
  Normal   Pulled                  10s                kubelet                  Successfully pulled image "nginx" in 1.517655829s
  Normal   Created                 10s                kubelet                  Created container nginx
  Normal   Started                 10s                kubelet                  Started container nginx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants