Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequent fs inconsistencies on trident pvcs, errors surfacing on resize operations #656

Closed
ffilippopoulos opened this issue Oct 14, 2021 · 11 comments

Comments

@ffilippopoulos
Copy link

ffilippopoulos commented Oct 14, 2021

Describe the bug

We run a Kubernetes cluster of 50 workers nodes, scheduling ~1K PVCs backed by netapp/trident

Very frequently, when we are resizing volumes we see the following error from trident:

time="2021-10-14T09:25:14Z" level=debug msg=">>>> osutils.execCommand." args="[/dev/dm-2]" command=resize2fs requestID=dbfa3c45-f2a5-4adb-bdfc-5f5232f9f1e5 requestSource=CSI
time="2021-10-14T09:25:16Z" level=debug msg="<<<< osutils.execCommand." command=resize2fs error="exit status 1" output="resize2fs 1.45.5 (07-Jan-2020)\nresize2fs: Permission denied to resize filesystem\n Filesystem at /dev/dm-2 is mounted on /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-65a48b3d-d2e4-4342-bf43-32fafd85e629/globalmount/tmp_mnt; on-line resizing required\nold_desc_blocks = 1, new_desc_blocks = 1" requestID=dbfa3c45-f2a5-4adb-bdfc-5f5232f9f1e5 requestSource=CSI
time="2021-10-14T09:25:16Z" level=error msg="Expanding filesystem failed; exit status 1" requestID=dbfa3c45-f2a5-4adb-bdfc-5f5232f9f1e5 requestSource=CSI

and on the respective host:

Oct 14 09:23:08 worker-33.dev.merit.uw.systems kernel: EXT4-fs (dm-2): warning: mounting fs with errors, running e2fsck is recommended
Oct 14 09:23:09 worker-33.dev.merit.uw.systems kernel: EXT4-fs (dm-2): mounted filesystem with ordered data mode. Opts: (null)
Oct 14 09:23:09 worker-33.dev.merit.uw.systems multipath[503836]: 3600a098038314438492b5057512f7536: adding new path sdd
Oct 14 09:23:10 worker-33.dev.merit.uw.systems kernel: EXT4-fs warning (device dm-2): ext4_resize_begin:72: There are errors in the filesystem, so online resizing is not allowed

After following the suggested workaround here: https://kb.netapp.com/Advice_and_Troubleshooting/Cloud_Services/Trident_Openshift/Resize_of_PVC_leaves_it_in_a_faulted_state
and run e2fsck we see that trident is able to complete resizing:

time="2021-10-14T09:35:38Z" level=debug msg=">>>> osutils.execCommand." args="[/dev/dm-2]" command=resize2fs requestID=dd35fa59-c38d-48d4-9f91-b5298075e7aa requestSource=CSI
time="2021-10-14T09:35:39Z" level=debug msg="<<<< osutils.execCommand." command=resize2fs error="<nil>" output="resize2fs 1.45.5 (07-Jan-2020)\nFilesystem at /dev/dm-2 is mounted on /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-65a48b3d-d2e4-4342-bf43-32fafd85e629/globalmount/tmp_mnt; on-line resizing required\nold_desc_blocks = 1, new_desc_blocks = 1\nThe filesystem on /dev/dm-2 is now 1310720 (4k) blocks long.\n" requestID=dd35fa59-c38d-48d4-9f91-b5298075e7aa requestSource=CSI
ime="2021-10-14T09:35:39Z" level=debug msg=">>>> osutils_linux.getFilesystemSize" requestID=dd35fa59-c38d-48d4-9f91-b5298075e7aa requestSource=CSI
time="2021-10-14T09:35:39Z" level=debug msg=">>>> osutils_linux.GetFilesystemStats" requestID=dd35fa59-c38d-48d4-9f91-b5298075e7aa requestSource=CSI
time="2021-10-14T09:35:39Z" level=debug msg="Filesystem size information" avail=51783183 blocks=57425380 bsize=4096 free=54717654 path=/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-65a48b3d-d2e4-4342-bf43-32fafd85e629/globalmount/tmp_mnt requestID=dd35fa59-c38d-48d4-9f91-b5298075e7aa requestSource=CSI size=235214356480                                      
time="2021-10-14T09:35:39Z" level=debug msg="<<<< osutils_linux.GetFilesystemStats" requestID=dd35fa59-c38d-48d4-9f91-b5298075e7aa requestSource=CSI
time="2021-10-14T09:35:39Z" level=debug msg="<<<< osutils_linux.getFilesystemSize" requestID=dd35fa59-c38d-48d4-9f91-b5298075e7aa requestSource=CSI
time="2021-10-14T09:35:39Z" level=warning msg="Failed to expand filesystem; size=235214356480" requestID=dd35fa59-c38d-48d4-9f91-b5298075e7aa requestSource=CSI             
time="2021-10-14T09:35:39Z" level=debug msg="<<<< osutils.expandFilesystem" cmd=resize2fs cmdArguments=/dev/dm-2 requestID=dd35fa59-c38d-48d4-9f91-b5298075e7aa requestSource=CSI tmpMountPoint=/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-65a48b3d-d2e4-4342-bf43-32fafd85e629/globalmount/tmp_mnt      

This is definitely not ideal as e2fsck should be ran on unmounted volumes and we are effectively racing trident.

Environment

  • Trident version: 21.07.1
  • Trident installation flags used: tridentctl install -n sys-trident --generate-custom-yaml
  • Container runtime: containerd://1.5.4
  • Kubernetes version: v1.21.5
  • OS: Flatcar Container Linux by Kinvolk 2905.2.5 (Oklo)
  • Kernel: 5.10.69-flatcar
  • NetApp backend types: ONTAP 9.9.1
  • Iscsi tools: iscsid version 2.1.4, multipath-tools v0.8.5 (11/09, 2020)

To Reproduce
The behaviour is not consistent, but occurs frequently on our volumes (we debug this on weekly basis at least). Most of our pvcs are backing datastores (for example the latest occurrence was on all 3 volumes of an etcd deployment).

Expected behavior
For one thing we need pointers to debug further how we end up with all these fs inconsistencies on our trident mounts. Our cluster is fairly loaded and we observe worker node failures from time to time. Could this result in the fs being shut down uncleanly? In any case we consider a node failure as a common thing that could happen in a Kubernetes cluster.

For another thing, this is definitely not ideal as e2fsck should be ran on unmounted volumes and we are effectively racing trident. While we need to understand what is causing this to our volumes, it seems like it should be trident's responsibility to run e2fsck when the volume is not mounted after failing to resize it, in order to guarantee that the operation will be safe.

@ffilippopoulos ffilippopoulos changed the title Frequent fs inconsistencies on trident pvcs, errors surfacing on resize operations. Trident should handle e2fsck commands itself. Frequent fs inconsistencies on trident pvcs, errors surfacing on resize operations Oct 14, 2021
@ffilippopoulos
Copy link
Author

Actually. this: kubernetes/kubernetes#78987 seems to be close to our issue.

@ffilippopoulos
Copy link
Author

To add a little more to this one: even though we are looking on our stack to debug iscsi and multipath and profile the circumstances under which our volumes can get corrupted, trident is the one responsible for running resize2fs, which fails as you can see from trident logs:

time="2021-10-14T09:25:14Z" level=debug msg=">>>> osutils.execCommand." args="[/dev/dm-2]" command=resize2fs requestID=dbfa3c45-f2a5-4adb-bdfc-5f5232f9f1e5 requestSource=CSI
time="2021-10-14T09:25:16Z" level=debug msg="<<<< osutils.execCommand." command=resize2fs error="exit status 1" output="resize2fs 1.45.5 (07-Jan-2020)\nresize2fs: Permission denied to resize filesystem\n Filesystem at /dev/dm-2 is mounted on /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-65a48b3d-d2e4-4342-bf43-32fafd85e629/globalmount/tmp_mnt; on-line resizing required\nold_desc_blocks = 1, new_desc_blocks = 1" requestID=dbfa3c45-f2a5-4adb-bdfc-5f5232f9f1e5 requestSource=CSI
time="2021-10-14T09:25:16Z" level=error msg="Expanding filesystem failed; exit status 1" requestID=dbfa3c45-f2a5-4adb-bdfc-5f5232f9f1e5 requestSource=CSI

How trident is handling that, is by doing a temporary mount for the volume (under /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-65a48b3d-d2e4-4342-bf43-32fafd85e629/globalmount/tmp_mnt in my example above) and issuing the resize2fs command there. Since the vol is corrupted, the resize command fails, as the case we are discussing describes.

At this point, all we can do is follow this troubleshooting NetApp link: https://kb.netapp.com/Advice_and_Troubleshooting/Cloud_Services/Trident_Openshift/Resize_of_PVC_leaves_it_in_a_faulted_state
Even though this will work, we just need to run e2fsck safely. Because trident call to resize2fs on the host comes back with an error, trident unmounts the temporary mount and backs off for some time until retrying the same operation again. As a result it effectively loops over and over doing:

temp mount -> resize2fs -> fail -> umount -> sleep

Now, for us to make a safe e2fsck, we need to scale down trident, since we do not want to run the risk of issuing e2fsck while the volume is temporarily mounted from trident for resizing. If we scale down trident we have to do everything manually (discover iscsi target and run e2fsck).
So, we are wondering if there is any possibility for trident to look at the resize2fs output after failure and decide whether it can attempt to run e2fsck on the volume that is already discovered on the node, so that we do not need to scale down trident or race it when fixing volumes. If e2fsck fails then it's fine. It will be the user's responsibility to have a look and act. But we appreciate that in most cases, simply running e2fsck will be sufficient to move forward.
Also, since there are instructions as part of the NetApp knowledge base, I do not think this is strictly related to our environment, but it's more of a case that occurs for other users.

@gnarl gnarl added the tracked label Oct 20, 2021
@gnarl
Copy link
Contributor

gnarl commented Oct 20, 2021

@ffilippopoulos we've discussed this issue and it seems like your problem determination is spot on so far. We are wondering about the workflow you are using that is causing the FS inconsistencies.

Are you seeing the issue with PVs created from snapshots or clones (which use snapshots)? If you take a snapshot without quiescing the application, which is a typical use case, then the snapshot has a potential FS inconsistency.

If you aren't seeing the issue with snapshots/clones then is there some other likely workflow that is causing the inconsistency?

@ffilippopoulos
Copy link
Author

Are you seeing the issue with PVs created from snapshots or clones (which use snapshots)?

No, actually we are not doing any NetApp snapshots for our pvcs.

If you aren't seeing the issue with snapshots/clones then is there some other likely workflow that is causing the inconsistency?

This is what we are still trying to figure. We have a fairly loaded environment (1K pvcs, 3.5K pods) and we get occasional node failures at which point I am not sure how iscsid is handling things and if there is any possibility to be leaving dirty filesystems behind.
Apart from that, we do not have any indication for what might be causing inconsistencies on our volumes filesystems (most of them are backing datastores like etcds and kafka).
We will keep gathering information and see if we can draw any useful conclusion.

@paalkr
Copy link

paalkr commented Jan 4, 2022

Any news regarding this issue? We face the exact same issue in our environment. Also a fairly busy cluster with around 3.5k pods and 500 PVCs. Our backend is actually AWS FSx for NetApp.

@ffilippopoulos
Copy link
Author

We are still running into this with a more generic error message using the latest trident version:

time="2022-12-05T08:44:06Z" level=error msg="Expanding filesystem failed; exit status 1" requestID=9e44818e-1c5a-4ea9-9cec-7737a01c1d7d requestSource=CSI                                                   
time="2022-12-05T08:44:06Z" level=info msg="Device unmounted." mountpoint=/var/lib/kubelet/plugins/kubernetes.io/csi/csi.trident.netapp.io/f090a147a67ecb2a2dddc21217850cbc35342cec6021ff0cdae29a7705c45251/
globalmount/tmp_mnt requestID=9e44818e-1c5a-4ea9-9cec-7737a01c1d7d requestSource=CSI                                                                                                                        
time="2022-12-05T08:44:06Z" level=error msg="Unable to expand filesystem." device=/dev/dm-15 error="exit status 1" filesystemType=ext4 requestID=9e44818e-1c5a-4ea9-9cec-7737a01c1d7d requestSource=CSI     
time="2022-12-05T08:44:06Z" level=error msg="GRPC error: rpc error: code = Internal desc = exit status 1" requestID=9e44818e-1c5a-4ea9-9cec-7737a01c1d7d requestSource=CSI 

We still need to follow manual steps to repair (scale down trident, unmount and run e2fsck). Is there any progress or anything planned regarding this issue?

@gnarl
Copy link
Contributor

gnarl commented Dec 6, 2022

Hi @ffilippopoulos,

A change was included in Trident v22.04.0 with commit 952659 to run fsck prior to the volume being attached. When a SAN volume is resized it is necessary to restart the Pod to complete the resize process. By doing this fsck should be run against your ext volume.

Can you turn on debug in your environment and verify that your most recent error is the same as the originally reported error?

@ffilippopoulos
Copy link
Author

@gnarl thanks for the response. I've already repaired manually errors for now, but will keep an eye and report back with debug logs on next occurrence.

@gnarl
Copy link
Contributor

gnarl commented Feb 12, 2023

Hi @ffilippopoulos,

Do you have an update on this GitHub issue?

@ffilippopoulos
Copy link
Author

@gnarl no, I haven't seen the issue in our clusters for a while now

@gnarl
Copy link
Contributor

gnarl commented Feb 13, 2023

@ffilippopoulos, thanks for the update. I'll close the issue and it can be reopened if needed.

@gnarl gnarl closed this as completed Feb 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants