Describe the bug
There is a daemonset csi-azuredisk-node with container azuredisk that has limits configured to 600Mi - this is not configurable anyhow by cluster admin, because any manual override will be immediately reverted by Azure. When any disk is reattaching during node recreation (for example caused by k8s update or something) fsck command is running. If fsck finds anything on the disk that needs attention repair procedure starts. Unfortunately in some cases for a huge disk, 600Mi is not enough and azuredisk is killed due OOM issue. After the restart of entire POD azuredisk will try to call fsck again so we just stuck in "OOM-crash-loop".
Some events that were visible in pod:
- Events from POD that is waiting for the volume but volume is not prepared by
azuredisk
Events:
Warning FailedMount 3m52s kubelet MountVolume.MountDevice failed for volume "pvc-8a3f9a3c-36c7-437a-abd9-e462ae1e4b56" : rpc error: code = Aborted desc = An operation with the given Volume ID /subscriptions/HIDDEN_SUBSCRIPTION/resourceGroups/HIDDEN_RG/pro
viders/Microsoft.Compute/disks/pvc-8a3f9a3c-36c7-437a-abd9-e462ae1e4b56 already exists.
Warning FailedMount 35s (x2 over 6m39s) kubelet MountVolume.MountDevice failed for volume "pvc-8a3f9a3c-36c7-437a-abd9-e462ae1e4b56" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
To Reproduce
Steps to reproduce the behavior:
- Prepare a huge volume (12500 GiB) and somehow cause some issues on the disk that fsck will try to repair.
- Recreate a node where pod using the volume is running.
- azuredisk finishes with OOM due to using more than 600Mi.
Expected behavior
Volume will be re-attached without any issues.
Screenshots

Environment (please complete the following information):
- CLI Version not applicable
- Kubernetes version v1.29.7
- CLI Extension version [e.g. 1.7.5] not applicable
- Browser [e.g. chrome, safari] not applicable
Additional context
To resolve the problem, I used a workaround from here, but I consider it as dangerous and not doable to use it again.
Describe the bug
There is a daemonset
csi-azuredisk-nodewith containerazurediskthat has limits configured to 600Mi - this is not configurable anyhow by cluster admin, because any manual override will be immediately reverted by Azure. When any disk is reattaching during node recreation (for example caused by k8s update or something) fsck command is running. If fsck finds anything on the disk that needs attention repair procedure starts. Unfortunately in some cases for a huge disk, 600Mi is not enough andazurediskis killed due OOM issue. After the restart of entire PODazurediskwill try to call fsck again so we just stuck in "OOM-crash-loop".Some events that were visible in pod:
azurediskTo Reproduce
Steps to reproduce the behavior:
Expected behavior
Volume will be re-attached without any issues.
Screenshots

Environment (please complete the following information):
Additional context
To resolve the problem, I used a workaround from here, but I consider it as dangerous and not doable to use it again.