Skip to content

[BUG] azuredisk container in csi-azuredisk-node is OOM killed during running fsck #4682

@mzylowski

Description

@mzylowski

Describe the bug
There is a daemonset csi-azuredisk-node with container azuredisk that has limits configured to 600Mi - this is not configurable anyhow by cluster admin, because any manual override will be immediately reverted by Azure. When any disk is reattaching during node recreation (for example caused by k8s update or something) fsck command is running. If fsck finds anything on the disk that needs attention repair procedure starts. Unfortunately in some cases for a huge disk, 600Mi is not enough and azuredisk is killed due OOM issue. After the restart of entire POD azuredisk will try to call fsck again so we just stuck in "OOM-crash-loop".

Some events that were visible in pod:

  • Events from POD that is waiting for the volume but volume is not prepared by azuredisk
Events:
  Warning  FailedMount  3m52s  kubelet            MountVolume.MountDevice failed for volume "pvc-8a3f9a3c-36c7-437a-abd9-e462ae1e4b56" : rpc error: code = Aborted desc = An operation with the given Volume ID /subscriptions/HIDDEN_SUBSCRIPTION/resourceGroups/HIDDEN_RG/pro
viders/Microsoft.Compute/disks/pvc-8a3f9a3c-36c7-437a-abd9-e462ae1e4b56 already exists.
 Warning  FailedMount  35s (x2 over 6m39s)  kubelet            MountVolume.MountDevice failed for volume "pvc-8a3f9a3c-36c7-437a-abd9-e462ae1e4b56" : rpc error: code = DeadlineExceeded desc = context deadline exceeded

To Reproduce
Steps to reproduce the behavior:

  1. Prepare a huge volume (12500 GiB) and somehow cause some issues on the disk that fsck will try to repair.
  2. Recreate a node where pod using the volume is running.
  3. azuredisk finishes with OOM due to using more than 600Mi.

Expected behavior
Volume will be re-attached without any issues.

Screenshots
Image

Environment (please complete the following information):

  • CLI Version not applicable
  • Kubernetes version v1.29.7
  • CLI Extension version [e.g. 1.7.5] not applicable
  • Browser [e.g. chrome, safari] not applicable

Additional context
To resolve the problem, I used a workaround from here, but I consider it as dangerous and not doable to use it again.

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions