Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: aks-log-collector.sh creates a large ip_netns_commands.txt which lead to ephemeral-storage issues #4148 #4326

Open
axelgMS opened this issue Apr 22, 2024 · 1 comment

Comments

@axelgMS
Copy link
Member

axelgMS commented Apr 22, 2024

What happened: cf Azure/AKS#4148

Describe the bug
We have observed that files ip_netns_commands.txt (example of location folder /tmp/tmp.4FKbTfOrn4/collect) in our AKS cluster nodes sometimes growing to many GBs and when the size comes to around 90GB nodes start having issues with ephemeral storage (The node was low on resource: ephemeral-storage.) then pods become evicted and multiple other issues appear.

root@aks-apps5--vmss000014:/tmp/tmp.4FKbTfOrn4/collect# ls -lh ip_netns_commands.txt
-rw-r--r-- 1 root root 38G Mar 7 13:19 ip_netns_commands.txt
root@aks-apps5--vmss000014:/tmp/tmp.4FKbTfOrn4/collect# fuser -v ip_netns_commands.txt
USER PID ACCESS COMMAND
/tmp/tmp.4FKbTfOrn4/collect/ip_netns_commands.txt:
root 83233 F.... ip
root 109814 F.... ip
root 560481 F.... ip
root 1133085 F.... ip
root 1133086 F.... ip
root 1133087 F.... ip
root 1134797 F.... ip
root 1737066 F.... ip
root 1737172 F.... ip
root 1737210 F.... ip
root 1737451 F.... ip

root@aks-apps5--vmss000014:/tmp/tmp.4FKbTfOrn4/collect# pstree -aps 83233
systemd,1
└─aks-log-collect,61814 /opt/azure/containers/aks-log-collector.sh
└─ip,83233 -all netns exec /bin/bash -x -c...
└─ip,109814 -all netns exec /bin/bash -x -c...
└─ip,1737066 -all netns exec /bin/bash -x -c...
└─ip,1737172 -all netns exec /bin/bash -x -c...
└─ip,1737210 -all netns exec /bin/bash -x -c...
└─ip,1737451 -all netns exec /bin/bash -x -c...
└─ip,560481 -all netns exec /bin/bash -x -c...
└─ip,1133085 -all netns exec /bin/bash -x -c...
└─bash,1147264 -x -c...
└─ss,1147282 -anoempiO --cgroup

root@aks-apps5--vmss000014:/tmp/tmp.4FKbTfOrn4/collect# head --lines 20 /opt/azure/containers/aks-log-collector.sh
#! /bin/bash

AKS Log Collector

This script collects information and logs that are useful to AKS engineering

for support and uploads them to the Azure host via a private API. These log

bundles are available to engineering when customers open a support case and

are especially useful for troubleshooting failures of networking or

kubernetes daemons.

This script runs via a systemd unit and slice that limits it to low CPU

priority and 128MB RAM, to avoid impacting other system functions.

Log bundle upload max size is limited to 100MB

MAX_SIZE=104857600

Shell options - remove non-matching globs, don't care about case, and use

extended pattern matching

shopt -s nullglob nocaseglob extglob

AKS 1.28.5

@UtheMan
Copy link
Contributor

UtheMan commented Apr 25, 2024

One way to stop the issue is disabling log collector on the nodes, it is controlled by a timer systemd unit.

systemctl stop aks-log-collector.timer
systemctl disable aks-log-collector.timer

This would have to be ran on every node. We are looking into a "long term" fix for this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants