-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Frequent fs inconsistencies on trident pvcs, errors surfacing on resize operations #656
Comments
Actually. this: kubernetes/kubernetes#78987 seems to be close to our issue. |
To add a little more to this one: even though we are looking on our stack to debug iscsi and multipath and profile the circumstances under which our volumes can get corrupted, trident is the one responsible for running
How trident is handling that, is by doing a temporary mount for the volume (under /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-65a48b3d-d2e4-4342-bf43-32fafd85e629/globalmount/tmp_mnt in my example above) and issuing the At this point, all we can do is follow this troubleshooting NetApp link: https://kb.netapp.com/Advice_and_Troubleshooting/Cloud_Services/Trident_Openshift/Resize_of_PVC_leaves_it_in_a_faulted_state temp mount -> resize2fs -> fail -> umount -> sleep Now, for us to make a safe |
@ffilippopoulos we've discussed this issue and it seems like your problem determination is spot on so far. We are wondering about the workflow you are using that is causing the FS inconsistencies. Are you seeing the issue with PVs created from snapshots or clones (which use snapshots)? If you take a snapshot without quiescing the application, which is a typical use case, then the snapshot has a potential FS inconsistency. If you aren't seeing the issue with snapshots/clones then is there some other likely workflow that is causing the inconsistency? |
No, actually we are not doing any NetApp snapshots for our pvcs.
This is what we are still trying to figure. We have a fairly loaded environment (1K pvcs, 3.5K pods) and we get occasional node failures at which point I am not sure how iscsid is handling things and if there is any possibility to be leaving dirty filesystems behind. |
Any news regarding this issue? We face the exact same issue in our environment. Also a fairly busy cluster with around 3.5k pods and 500 PVCs. Our backend is actually AWS FSx for NetApp. |
We are still running into this with a more generic error message using the latest trident version:
We still need to follow manual steps to repair (scale down trident, unmount and run e2fsck). Is there any progress or anything planned regarding this issue? |
Hi @ffilippopoulos, A change was included in Trident v22.04.0 with commit 952659 to run fsck prior to the volume being attached. When a SAN volume is resized it is necessary to restart the Pod to complete the resize process. By doing this fsck should be run against your ext volume. Can you turn on debug in your environment and verify that your most recent error is the same as the originally reported error? |
@gnarl thanks for the response. I've already repaired manually errors for now, but will keep an eye and report back with debug logs on next occurrence. |
Hi @ffilippopoulos, Do you have an update on this GitHub issue? |
@gnarl no, I haven't seen the issue in our clusters for a while now |
@ffilippopoulos, thanks for the update. I'll close the issue and it can be reopened if needed. |
Describe the bug
We run a Kubernetes cluster of 50 workers nodes, scheduling ~1K
PVC
s backed by netapp/tridentVery frequently, when we are resizing volumes we see the following error from trident:
and on the respective host:
After following the suggested workaround here: https://kb.netapp.com/Advice_and_Troubleshooting/Cloud_Services/Trident_Openshift/Resize_of_PVC_leaves_it_in_a_faulted_state
and run
e2fsck
we see that trident is able to complete resizing:This is definitely not ideal as
e2fsck
should be ran on unmounted volumes and we are effectively racing trident.Environment
To Reproduce
The behaviour is not consistent, but occurs frequently on our volumes (we debug this on weekly basis at least). Most of our pvcs are backing datastores (for example the latest occurrence was on all 3 volumes of an etcd deployment).
Expected behavior
For one thing we need pointers to debug further how we end up with all these fs inconsistencies on our trident mounts. Our cluster is fairly loaded and we observe worker node failures from time to time. Could this result in the fs being shut down uncleanly? In any case we consider a node failure as a common thing that could happen in a Kubernetes cluster.
For another thing, this is definitely not ideal as
e2fsck
should be ran on unmounted volumes and we are effectively racing trident. While we need to understand what is causing this to our volumes, it seems like it should be trident's responsibility to rune2fsck
when the volume is not mounted after failing to resize it, in order to guarantee that the operation will be safe.The text was updated successfully, but these errors were encountered: