feat: reduce node impact by aks-log-collector#8598
Merged
Conversation
64ffd3c to
c95385c
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR reduces node impact from aks-log-collector by (1) splitting “normal” vs “full” log collection, (2) gating expensive system introspection behind IMDS flags, and (3) adjusting when/why collection is triggered across VHD build vs node provisioning.
Changes:
- Make
aks-log-collector.shlighter by default (optionalwaagent_full+sysinfoflags; tail-truncate files >1MB; skip.gz). - Delay and simplify timer behavior, and start the timer only after kubelet is healthy during provisioning.
- Remove background log upload on successful CSE completion (keep upload on failure).
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
vhdbuilder/packer/pre-install-dependencies.sh |
Stops enabling the log collector during VHD build; adjusts systemd handling for the collector unit. |
parts/linux/cloud-init/artifacts/cse_start.sh |
Removes success-path background log upload; retains upload on failure. |
parts/linux/cloud-init/artifacts/cse_main.sh |
Enables/starts the log-collector timer after kubelet health check. |
parts/linux/cloud-init/artifacts/aks-log-collector.timer |
Changes initial scheduling (now delayed) and simplifies timer triggers. |
parts/linux/cloud-init/artifacts/aks-log-collector.sh |
Introduces normal vs full collection, sysinfo gating, and tail-truncation/skip logic to reduce IO/lock impact. |
|
|
||
| # Based on MANIFEST_FULL from Azure Linux Agent's log collector | ||
| # https://github.com/Azure/WALinuxAgent/blob/master/azurelinuxagent/ga/logcollector_manifests.py | ||
| if [ "$COLLECT_WAAGENT_FULL" = "true" ]; then |
Contributor
There was a problem hiding this comment.
nit: might be a good idea to send the flags to lower-case
c95385c to
dbf2fb0
Compare
dbf2fb0 to
69fb41a
Compare
Comment on lines
+589
to
+590
| systemctlEnableAndStartNoBlock aks-log-collector.timer 30 || echo "Warning: Could not start aks-log-collector.timer" | ||
|
|
69fb41a to
e1b3ee1
Compare
e1b3ee1 to
31ff24f
Compare
Comment on lines
+328
to
+347
| fsize=$(stat --printf "%s" "$file") | ||
| if [ "$fsize" -gt "$MAX_FILE_SIZE" ]; then | ||
| # Preserve directory structure so zip entry has the original path | ||
| truncdir="${file%/*}" | ||
| mkdir -p ".${truncdir}" | ||
| mkfifo ".${file}" | ||
| tail -c "$MAX_FILE_SIZE" "$file" >".${file}" & | ||
| tail_pid=$! | ||
| zip -gDZ deflate --fifo "${ZIP}" ".${file}" | ||
| wait "$tail_pid" 2>/dev/null | ||
| rm -f ".${file}" | ||
| else | ||
| zip -g -DZ deflate -u "${ZIP}" "$file" -x '*.sock' | ||
| fi | ||
|
|
||
| FILE_SIZE=$(stat --printf "%s" "${ZIP}") | ||
| if [ "$FILE_SIZE" -ge "$MAX_SIZE" ]; then | ||
| echo "WARNING: ZIP file size $FILE_SIZE >= $MAX_SIZE; removing last log file and terminating adding more files." | ||
| zip -d "${ZIP}" $file | ||
| break |
31ff24f to
754bd58
Compare
754bd58 to
ac25102
Compare
Comment on lines
+343
to
347
| FILE_SIZE=$(stat --printf "%s" "${ZIP}") | ||
| if [ "$FILE_SIZE" -ge "$MAX_SIZE" ]; then | ||
| echo "WARNING: ZIP file size $FILE_SIZE >= $MAX_SIZE; stopping." | ||
| break | ||
| fi |
| echo "Adding log files to zip archive with max file size: $MAX_FILE_SIZE bytes..." | ||
| for file in "${GLOBS[@]}"; do | ||
| # shellcheck disable=SC3010 | ||
| [[ "$file" == *.gz ]] && continue |
Comment on lines
+331
to
+334
| truncdir="${file%/*}" | ||
| mkdir -p ".${truncdir}" | ||
| mkfifo ".${file}" | ||
| tail -c "$MAX_FILE_SIZE" "$file" >".${file}" & |
Comment on lines
+321
to
+322
| MAX_FILE_SIZE=$((10 * 1024 * 1024)) | ||
| echo "Adding log files to zip archive with max file size: $MAX_FILE_SIZE bytes..." |
Comment on lines
110
to
+112
| # force a log upload to the host after the provisioning script finishes | ||
| # if we failed, wait for the upload to complete so that we don't remove | ||
| # the VM before it finishes. if we succeeded, upload in the background | ||
| # so that the provisioning script returns success more quickly | ||
| # the VM before it finishes. |
Comment on lines
4
to
6
| [Timer] | ||
| OnActiveSec=0m | ||
| OnBootSec=5min | ||
| OnActiveSec=10min | ||
| OnUnitActiveSec=60m |
Comment on lines
205
to
+210
| # Collect general information and create the ZIP in the first place | ||
| zip -DZ deflate "${ZIP}" /proc/@(cmdline|cpuinfo|filesystems|interrupts|loadavg|meminfo|modules|mounts|slabinfo|stat|uptime|version*|vmstat) /proc/net/* | ||
|
|
||
| # Include some disk listings | ||
| collectToZip collect/file_listings.txt find /dev /etc /var/lib/waagent /var/log -ls | ||
|
|
||
| # Collect system information | ||
| collectToZip collect/blkid.txt blkid $(find /dev -type b ! -name 'sr*') | ||
| collectToZip collect/du_bytes.txt df -al | ||
| collectToZip collect/du_inodes.txt df -ail | ||
| collectToZip collect/diskinfo.txt lsblk | ||
| collectToZip collect/lscpu.txt lscpu | ||
| collectToZip collect/lscpu.json lscpu -J | ||
| collectToZip collect/lsipc.txt lsipc | ||
| collectToZip collect/lsns.json lsns -J --output-all | ||
| collectToZip collect/lspci.txt lspci -vkPP | ||
| collectToZip collect/lsscsi.txt lsscsi -vv | ||
| collectToZip collect/lsvmbus.txt lsvmbus -vv | ||
| collectToZip collect/sysctl.txt sysctl -a | ||
| collectToZip collect/systemctl-status.txt systemctl status --all -fr | ||
| zip -DZ deflate "${ZIP}" /proc/@(cmdline|loadavg|meminfo|mounts|uptime|version*) | ||
|
|
||
| if [ "$COLLECT_SYSINFO" = "true" ]; then | ||
| # Extensive proc info | ||
| zip -gDZ deflate "${ZIP}" /proc/@(cpuinfo|filesystems|interrupts|modules|slabinfo|stat|vmstat) /proc/net/* |
Summary of Changes 1. aks-log-collector.sh — Reduce node impact by splitting normal/full collection Disk I/O reduction: - Added "waagent_full" config flag (default false) that gates the heavy WALinuxAgent-style manifest (rotated syslogs, journal, kern, dpkg, boot, secure, etc.) - Default (normal) mode only collects: AKS configs, GPU logs, current syslog, auth.log, dmesg, cloud-init, azure extension logs - Files >1MB are tail-truncated via fifo — avoids reading multi-GB logs - .gz (already-compressed rotated logs) are skipped entirely - Truncated files preserve their original path in the zip Disruptive commands gated behind sysinfo flag: - conntrack -L / conntrack -S (locks conntrack spinlock) - ss -anoempiO --cgroup (locks socket tables) - systemctl status --all -fr (queries all units) - find /dev /etc /var/lib/waagent /var/log -ls (heavy I/O) - crictl stats / crictl statsp (can stall containerd) - blkid, sysctl -a, lspci -vkPP, etc. Per-netns disruptive calls (conntrack, ss) also gated behind sysinfo within the COLLECT_NETNS block. --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 2. aks-log-collector.timer — Delay first run, simplify triggers - Removed OnActiveSec=0m (no longer runs immediately on timer activation) - Changed OnBootSec=5min → OnBootSec=10min (gives node more time to stabilize before first collection) - Keeps OnUnitActiveSec=60m (hourly thereafter) --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3. cse_main.sh — Start log collector after kubelet is healthy - Added systemctlEnableAndStartNoBlock aks-log-collector.timer 30 || true after the kubelet health check - Ensures log collection only begins once the node is provisioned and kubelet is running --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 4. cse_start.sh — Remove background log upload on success - Removed upload_logs & on successful CSE exit - Logs are still uploaded on CSE failure (EXIT_CODE != 0) - Reduces unnecessary background I/O on the happy path --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 5. pre-install-dependencies.sh — Don't start timer during VHD build - Changed from systemctlEnableAndStart aks-log-collector.timer to systemctl disable --now aks-log-collector.service - Timer is no longer started during VHD build (it gets enabled at provisioning time via CSE instead) - Still disables WALA log collection via waagent.conf --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Net effect By default (all flags false), the log collector now: - Waits 10 minutes after boot - Only collects lightweight AKS-specific configs + last 1MB of key logs - Skips all expensive system introspection commands - Doesn't read rotated/compressed logs - Runs hourly with minimal disk and kernel impact Full diagnostics are available on-demand via IMDS tags (sysinfo, waagent_full, netns, iptables, nftables).
ac25102 to
f015403
Compare
cameronmeissner
approved these changes
May 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary of Changes
Disk I/O reduction:
Disruptive commands gated behind sysinfo flag:
Per-netns disruptive calls (conntrack, ss) also gated behind sysinfo within the COLLECT_NETNS block.
Net effect
By default (all flags false), the log collector now:
Full diagnostics are available on-demand via IMDS tags (sysinfo, waagent_full, netns, iptables, nftables).