Skip to content

feat: reduce node impact by aks-log-collector#8598

Merged
awesomenix merged 1 commit into
mainfrom
nishp/update/kube
May 27, 2026
Merged

feat: reduce node impact by aks-log-collector#8598
awesomenix merged 1 commit into
mainfrom
nishp/update/kube

Conversation

@awesomenix
Copy link
Copy Markdown
Contributor

Summary of Changes

  1. aks-log-collector.sh — Reduce node impact by splitting normal/full collection

Disk I/O reduction:

  • Added "waagent_full" config flag (default false) that gates the heavy WALinuxAgent-style manifest (rotated syslogs, journal, kern, dpkg, boot, secure, etc.)
  • Default (normal) mode only collects: AKS configs, GPU logs, current syslog, auth.log, dmesg, cloud-init, azure extension logs
  • Files >1MB are tail-truncated via fifo — avoids reading multi-GB logs
  • .gz (already-compressed rotated logs) are skipped entirely
  • Truncated files preserve their original path in the zip

Disruptive commands gated behind sysinfo flag:

  • conntrack -L / conntrack -S (locks conntrack spinlock)
  • ss -anoempiO --cgroup (locks socket tables)
  • systemctl status --all -fr (queries all units)
  • find /dev /etc /var/lib/waagent /var/log -ls (heavy I/O)
  • crictl stats / crictl statsp (can stall containerd)
  • blkid, sysctl -a, lspci -vkPP, etc.

Per-netns disruptive calls (conntrack, ss) also gated behind sysinfo within the COLLECT_NETNS block.


  1. aks-log-collector.timer — Delay first run, simplify triggers
  • Removed OnActiveSec=0m (no longer runs immediately on timer activation)
  • Changed OnBootSec=5min → OnBootSec=10min (gives node more time to stabilize before first collection)
  • Keeps OnUnitActiveSec=60m (hourly thereafter)

  1. cse_main.sh — Start log collector after kubelet is healthy
  • Added systemctlEnableAndStartNoBlock aks-log-collector.timer 30 || true after the kubelet health check
  • Ensures log collection only begins once the node is provisioned and kubelet is running

  1. cse_start.sh — Remove background log upload on success
  • Removed upload_logs & on successful CSE exit
  • Logs are still uploaded on CSE failure (EXIT_CODE != 0)
  • Reduces unnecessary background I/O on the happy path

  1. pre-install-dependencies.sh — Don't start timer during VHD build
  • Changed from systemctlEnableAndStart aks-log-collector.timer to systemctl disable --now aks-log-collector.service
  • Timer is no longer started during VHD build (it gets enabled at provisioning time via CSE instead)
  • Still disables WALA log collection via waagent.conf

Net effect

By default (all flags false), the log collector now:

  • Waits 10 minutes after boot
  • Only collects lightweight AKS-specific configs + last 1MB of key logs
  • Skips all expensive system introspection commands
  • Doesn't read rotated/compressed logs
  • Runs hourly with minimal disk and kernel impact

Full diagnostics are available on-demand via IMDS tags (sysinfo, waagent_full, netns, iptables, nftables).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces node impact from aks-log-collector by (1) splitting “normal” vs “full” log collection, (2) gating expensive system introspection behind IMDS flags, and (3) adjusting when/why collection is triggered across VHD build vs node provisioning.

Changes:

  • Make aks-log-collector.sh lighter by default (optional waagent_full + sysinfo flags; tail-truncate files >1MB; skip .gz).
  • Delay and simplify timer behavior, and start the timer only after kubelet is healthy during provisioning.
  • Remove background log upload on successful CSE completion (keep upload on failure).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
vhdbuilder/packer/pre-install-dependencies.sh Stops enabling the log collector during VHD build; adjusts systemd handling for the collector unit.
parts/linux/cloud-init/artifacts/cse_start.sh Removes success-path background log upload; retains upload on failure.
parts/linux/cloud-init/artifacts/cse_main.sh Enables/starts the log-collector timer after kubelet health check.
parts/linux/cloud-init/artifacts/aks-log-collector.timer Changes initial scheduling (now delayed) and simplifies timer triggers.
parts/linux/cloud-init/artifacts/aks-log-collector.sh Introduces normal vs full collection, sysinfo gating, and tail-truncation/skip logic to reduce IO/lock impact.

Comment thread parts/linux/cloud-init/artifacts/aks-log-collector.timer
Comment thread parts/linux/cloud-init/artifacts/aks-log-collector.sh
Comment thread parts/linux/cloud-init/artifacts/aks-log-collector.sh Outdated
Comment thread parts/linux/cloud-init/artifacts/cse_start.sh
Comment thread vhdbuilder/packer/pre-install-dependencies.sh Outdated

# Based on MANIFEST_FULL from Azure Linux Agent's log collector
# https://github.com/Azure/WALinuxAgent/blob/master/azurelinuxagent/ga/logcollector_manifests.py
if [ "$COLLECT_WAAGENT_FULL" = "true" ]; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: might be a good idea to send the flags to lower-case

Comment thread parts/linux/cloud-init/artifacts/aks-log-collector.sh Outdated
Comment thread parts/linux/cloud-init/artifacts/aks-log-collector.timer Outdated
Comment thread parts/linux/cloud-init/artifacts/cse_main.sh Outdated
@awesomenix awesomenix force-pushed the nishp/update/kube branch from c95385c to dbf2fb0 Compare May 27, 2026 19:11
Copilot AI review requested due to automatic review settings May 27, 2026 19:13
@awesomenix awesomenix force-pushed the nishp/update/kube branch from dbf2fb0 to 69fb41a Compare May 27, 2026 19:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Comment on lines +589 to +590
systemctlEnableAndStartNoBlock aks-log-collector.timer 30 || echo "Warning: Could not start aks-log-collector.timer"

Comment thread parts/linux/cloud-init/artifacts/cse_main.sh Outdated
Comment thread parts/linux/cloud-init/artifacts/aks-log-collector.sh Outdated
Comment thread parts/linux/cloud-init/artifacts/aks-log-collector.sh
Comment thread vhdbuilder/packer/pre-install-dependencies.sh
Comment thread parts/linux/cloud-init/artifacts/aks-log-collector.timer
Comment thread parts/linux/cloud-init/artifacts/aks-log-collector.sh
@awesomenix awesomenix force-pushed the nishp/update/kube branch from 69fb41a to e1b3ee1 Compare May 27, 2026 19:57
Copilot AI review requested due to automatic review settings May 27, 2026 20:01
@awesomenix awesomenix force-pushed the nishp/update/kube branch from e1b3ee1 to 31ff24f Compare May 27, 2026 20:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Comment thread parts/linux/cloud-init/artifacts/aks-log-collector.timer
Comment thread parts/linux/cloud-init/artifacts/cse_start.sh
Comment on lines +328 to +347
fsize=$(stat --printf "%s" "$file")
if [ "$fsize" -gt "$MAX_FILE_SIZE" ]; then
# Preserve directory structure so zip entry has the original path
truncdir="${file%/*}"
mkdir -p ".${truncdir}"
mkfifo ".${file}"
tail -c "$MAX_FILE_SIZE" "$file" >".${file}" &
tail_pid=$!
zip -gDZ deflate --fifo "${ZIP}" ".${file}"
wait "$tail_pid" 2>/dev/null
rm -f ".${file}"
else
zip -g -DZ deflate -u "${ZIP}" "$file" -x '*.sock'
fi

FILE_SIZE=$(stat --printf "%s" "${ZIP}")
if [ "$FILE_SIZE" -ge "$MAX_SIZE" ]; then
echo "WARNING: ZIP file size $FILE_SIZE >= $MAX_SIZE; removing last log file and terminating adding more files."
zip -d "${ZIP}" $file
break
Comment thread parts/linux/cloud-init/artifacts/aks-log-collector.sh
@awesomenix awesomenix force-pushed the nishp/update/kube branch from 31ff24f to 754bd58 Compare May 27, 2026 20:19
Comment thread parts/linux/cloud-init/artifacts/aks-log-collector.sh
Copilot AI review requested due to automatic review settings May 27, 2026 22:24
@awesomenix awesomenix force-pushed the nishp/update/kube branch from 754bd58 to ac25102 Compare May 27, 2026 22:24
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Comment on lines +343 to 347
FILE_SIZE=$(stat --printf "%s" "${ZIP}")
if [ "$FILE_SIZE" -ge "$MAX_SIZE" ]; then
echo "WARNING: ZIP file size $FILE_SIZE >= $MAX_SIZE; stopping."
break
fi
echo "Adding log files to zip archive with max file size: $MAX_FILE_SIZE bytes..."
for file in "${GLOBS[@]}"; do
# shellcheck disable=SC3010
[[ "$file" == *.gz ]] && continue
Comment on lines +331 to +334
truncdir="${file%/*}"
mkdir -p ".${truncdir}"
mkfifo ".${file}"
tail -c "$MAX_FILE_SIZE" "$file" >".${file}" &
Comment on lines +321 to +322
MAX_FILE_SIZE=$((10 * 1024 * 1024))
echo "Adding log files to zip archive with max file size: $MAX_FILE_SIZE bytes..."
Comment on lines 110 to +112
# force a log upload to the host after the provisioning script finishes
# if we failed, wait for the upload to complete so that we don't remove
# the VM before it finishes. if we succeeded, upload in the background
# so that the provisioning script returns success more quickly
# the VM before it finishes.
Comment on lines 4 to 6
[Timer]
OnActiveSec=0m
OnBootSec=5min
OnActiveSec=10min
OnUnitActiveSec=60m
Comment on lines 205 to +210
# Collect general information and create the ZIP in the first place
zip -DZ deflate "${ZIP}" /proc/@(cmdline|cpuinfo|filesystems|interrupts|loadavg|meminfo|modules|mounts|slabinfo|stat|uptime|version*|vmstat) /proc/net/*

# Include some disk listings
collectToZip collect/file_listings.txt find /dev /etc /var/lib/waagent /var/log -ls

# Collect system information
collectToZip collect/blkid.txt blkid $(find /dev -type b ! -name 'sr*')
collectToZip collect/du_bytes.txt df -al
collectToZip collect/du_inodes.txt df -ail
collectToZip collect/diskinfo.txt lsblk
collectToZip collect/lscpu.txt lscpu
collectToZip collect/lscpu.json lscpu -J
collectToZip collect/lsipc.txt lsipc
collectToZip collect/lsns.json lsns -J --output-all
collectToZip collect/lspci.txt lspci -vkPP
collectToZip collect/lsscsi.txt lsscsi -vv
collectToZip collect/lsvmbus.txt lsvmbus -vv
collectToZip collect/sysctl.txt sysctl -a
collectToZip collect/systemctl-status.txt systemctl status --all -fr
zip -DZ deflate "${ZIP}" /proc/@(cmdline|loadavg|meminfo|mounts|uptime|version*)

if [ "$COLLECT_SYSINFO" = "true" ]; then
# Extensive proc info
zip -gDZ deflate "${ZIP}" /proc/@(cpuinfo|filesystems|interrupts|modules|slabinfo|stat|vmstat) /proc/net/*
Summary of Changes

1. aks-log-collector.sh — Reduce node impact by splitting normal/full collection

Disk I/O reduction:

 - Added "waagent_full" config flag (default false) that gates the heavy WALinuxAgent-style manifest (rotated syslogs, journal, kern, dpkg, boot, secure, etc.)
 - Default (normal) mode only collects: AKS configs, GPU logs, current syslog, auth.log, dmesg, cloud-init, azure extension logs
 - Files >1MB are tail-truncated via fifo — avoids reading multi-GB logs
 - .gz (already-compressed rotated logs) are skipped entirely
 - Truncated files preserve their original path in the zip

Disruptive commands gated behind sysinfo flag:

 - conntrack -L / conntrack -S (locks conntrack spinlock)
 - ss -anoempiO --cgroup (locks socket tables)
 - systemctl status --all -fr (queries all units)
 - find /dev /etc /var/lib/waagent /var/log -ls (heavy I/O)
 - crictl stats / crictl statsp (can stall containerd)
 - blkid, sysctl -a, lspci -vkPP, etc.

Per-netns disruptive calls (conntrack, ss) also gated behind sysinfo within the COLLECT_NETNS block.

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

2. aks-log-collector.timer — Delay first run, simplify triggers

 - Removed OnActiveSec=0m (no longer runs immediately on timer activation)
 - Changed OnBootSec=5min → OnBootSec=10min (gives node more time to stabilize before first collection)
 - Keeps OnUnitActiveSec=60m (hourly thereafter)

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

3. cse_main.sh — Start log collector after kubelet is healthy

 - Added systemctlEnableAndStartNoBlock aks-log-collector.timer 30 || true after the kubelet health check
 - Ensures log collection only begins once the node is provisioned and kubelet is running

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

4. cse_start.sh — Remove background log upload on success

 - Removed upload_logs & on successful CSE exit
 - Logs are still uploaded on CSE failure (EXIT_CODE != 0)
 - Reduces unnecessary background I/O on the happy path

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

5. pre-install-dependencies.sh — Don't start timer during VHD build

 - Changed from systemctlEnableAndStart aks-log-collector.timer to systemctl disable --now aks-log-collector.service
 - Timer is no longer started during VHD build (it gets enabled at provisioning time via CSE instead)
 - Still disables WALA log collection via waagent.conf

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Net effect

By default (all flags false), the log collector now:

 - Waits 10 minutes after boot
 - Only collects lightweight AKS-specific configs + last 1MB of key logs
 - Skips all expensive system introspection commands
 - Doesn't read rotated/compressed logs
 - Runs hourly with minimal disk and kernel impact

Full diagnostics are available on-demand via IMDS tags (sysinfo, waagent_full, netns, iptables, nftables).
@awesomenix awesomenix force-pushed the nishp/update/kube branch from ac25102 to f015403 Compare May 27, 2026 22:33
@awesomenix awesomenix merged commit 0dfdff8 into main May 27, 2026
20 of 32 checks passed
@awesomenix awesomenix deleted the nishp/update/kube branch May 27, 2026 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants