feat: add windows-log-analysis Copilot skill (LLM sub-skills)#8214
feat: add windows-log-analysis Copilot skill (LLM sub-skills)#8214timmy-wright wants to merge 8 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new GitHub Copilot skill under .github/skills/windows-log-analysis/ to help diagnose Windows AKS node issues from log bundles produced by staging/cse/windows/debug/collect-windows-logs.ps1, including an accompanying Python analyzer script.
Changes:
- Introduces
SKILL.mdwith a log-bundle reference guide and troubleshooting playbooks. - Adds
analyze-windows-logs.pyto scan multi-snapshot bundles, trend key metrics, and emit prioritized findings.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 10 comments.
| File | Description |
|---|---|
| .github/skills/windows-log-analysis/SKILL.md | Skill definition and reference guide for interpreting collected Windows node logs |
| .github/skills/windows-log-analysis/analyze-windows-logs.py | Python 3 analyzer for automated triage of collected Windows log bundles |
.github/skills/windows-log-analysis/sub-skills/common-reference.md
Outdated
Show resolved
Hide resolved
.github/skills/windows-log-analysis/sub-skills/analyze-services.md
Outdated
Show resolved
Hide resolved
.github/skills/windows-log-analysis/sub-skills/analyze-kubelet.md
Outdated
Show resolved
Hide resolved
.github/skills/windows-log-analysis/sub-skills/common-reference.md
Outdated
Show resolved
Hide resolved
.github/skills/windows-log-analysis/sub-skills/common-reference.md
Outdated
Show resolved
Hide resolved
.github/skills/windows-log-analysis/sub-skills/common-reference.md
Outdated
Show resolved
Hide resolved
b63d0ec to
b6818e8
Compare
| | `kubelet.log` | UTF-8 | Kubelet stdout logs (if present) | | ||
| | `kubelet.err.log` | UTF-8 | Kubelet stderr logs (if present) | | ||
| | `<ts>-cri-containerd-pods.txt` | UTF-16-LE with BOM | `crictl pods` — cross-reference pod state | | ||
| | `*_services.csv` | UTF-8 | Service status timeline used for kubelet crash/restart and clock skew checks | |
There was a problem hiding this comment.
*_services.csv is exported by collect-windows-logs.ps1 via Export-Csv without -Encoding, which defaults to UTF-16LE on Windows PowerShell. Marking it as UTF-8 here will cause parsers to mis-decode the file; update the encoding (and ideally the pattern to <ts>_services.csv for consistency with other entries).
| | `*_services.csv` | UTF-8 | Service status timeline used for kubelet crash/restart and clock skew checks | | |
| | `<ts>_services.csv` | UTF-16-LE with BOM | Service status timeline used for kubelet crash/restart and clock skew checks | |
| | `bootstrap-config` | analyze-bootstrap | | ||
| | `*-ccg-*.evtx` or CCG event logs | analyze-gmsa | | ||
| | `gmsa-*.log` or gMSA credential spec files | analyze-gmsa | | ||
| | `kubectl-describe-nodes.log` | analyze-gpu, analyze-kubelet | |
There was a problem hiding this comment.
File-dispatch mapping seems incomplete: kubectl-describe-nodes.log is consumed by analyze-kubelet.md (node conditions/taints/events) as well as GPU analysis, but the table only routes it to analyze-gpu. Add analyze-kubelet here to avoid skipping kubelet triage when this file is present.
.github/skills/windows-log-analysis/sub-skills/common-reference.md
Outdated
Show resolved
Hide resolved
…e.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
.github/skills/windows-log-analysis/sub-skills/common-reference.md
Outdated
Show resolved
Hide resolved
| **WINDOWS_CSE_ERROR codes** (from AgentBaker `windowscsehelper.ps1`): | ||
|
|
||
| | Code | Name | Meaning | | ||
| |------|------|---------| | ||
| | 0 | SUCCESS | CSE completed successfully | | ||
| | 1 | UNKNOWN | Unexpected error in catch block | | ||
| | 2 | DOWNLOAD_FILE_WITH_RETRY | File download failed after retries | | ||
| | 3 | INVOKE_EXECUTABLE | Executable invocation failed | | ||
| | 4 | FILE_NOT_EXIST | Required file missing | | ||
| | 5 | CHECK_API_SERVER_CONNECTIVITY | Cannot reach API server | | ||
| | 6 | PAUSE_IMAGE_NOT_EXIST | Pause container image missing | | ||
| | 7 | GET_SUBNET_PREFIX | Failed to get subnet prefix | | ||
| | 8 | GENERATE_TOKEN_FOR_ARM | ARM token generation failed | | ||
| | 9 | NETWORK_INTERFACES_NOT_EXIST | No network interfaces found | | ||
| | 10 | NETWORK_ADAPTER_NOT_EXIST | Network adapter missing | | ||
| | 11 | MANAGEMENT_IP_NOT_EXIST | Management IP not found | | ||
| | 12 | CALICO_SERVICE_ACCOUNT_NOT_EXIST | Calico SA missing | | ||
| | 13 | CONTAINERD_NOT_INSTALLED | containerd binary not found | | ||
| | 14 | CONTAINERD_NOT_RUNNING | containerd service not running | | ||
| | 15 | OPENSSH_NOT_INSTALLED | OpenSSH not installed | | ||
| | 16 | OPENSSH_FIREWALL_NOT_CONFIGURED | OpenSSH firewall rule missing | | ||
| | 17 | INVALID_PARAMETER_IN_AZURE_CONFIG | Bad azure.json parameter | | ||
| | 19 | GET_CA_CERTIFICATES | CA cert retrieval failed | | ||
| | 20 | DOWNLOAD_CA_CERTIFICATES | CA cert download failed | | ||
| | 21 | EMPTY_CA_CERTIFICATES | CA certs empty | | ||
| | 22 | ENABLE_SECURE_TLS | Secure TLS enablement failed | | ||
| | 23–28 | GMSA_* | gMSA setup failures | | ||
| | 29 | NOT_FOUND_MANAGEMENT_IP | Management IP lookup failed | | ||
| | 30 | NOT_FOUND_BUILD_NUMBER | Windows build number not found | | ||
| | 31 | NOT_FOUND_PROVISIONING_SCRIPTS | Provisioning scripts missing | | ||
| | 32 | START_NODE_RESET_SCRIPT_TASK | Node reset task failed to start | | ||
| | 33–40 | DOWNLOAD_*_PACKAGE | Package download failures (CSE, K8s, CNI, HNS, Calico, gMSA, CSI proxy, containerd) | | ||
| | 41 | SET_TCP_DYNAMIC_PORT_RANGE | TCP port range configuration failed | | ||
| | 43 | PULL_PAUSE_IMAGE | Pause image pull failed | | ||
| | 45 | CONTAINERD_BINARY_EXIST | containerd binary check failed | | ||
| | 46–48 | SET_*_PORT_RANGE | Port range exclusion failures | | ||
| | 49 | NO_CUSTOM_DATA_BIN | CustomData.bin missing (very early failure) | | ||
| | 50 | NO_CSE_RESULT_LOG | CSE did not produce result log | | ||
| | 52 | RESIZE_OS_DRIVE | OS drive resize failed | | ||
| | 53–61 | GPU_* | GPU driver installation failures | | ||
| | 62 | UPDATING_KUBE_CLUSTER_CONFIG | Kube cluster config update failed | | ||
| | 64 | GET_CONTAINERD_VERSION | containerd version detection failed | | ||
| | 65–67 | CREDENTIAL_PROVIDER_* | Credential provider install/config failures | | ||
| | 68 | ADJUST_PAGEFILE_SIZE | Pagefile resize failed | | ||
| | 70–71 | SECURE_TLS_BOOTSTRAP_* | Secure TLS bootstrap client failures | | ||
| | 72 | CILIUM_NETWORKING_INSTALL_FAILED | Cilium install failed | | ||
| | 73 | EXTRACT_ZIP | Zip extraction failed | | ||
| | 74–75 | LOAD/PARSE_METADATA | Metadata failures | | ||
| | 76–83 | ORAS_* | Network-isolated cluster artifact pull failures | |
There was a problem hiding this comment.
This section says the table is sourced from windowscsehelper.ps1 and later references a “full code table”, but the table omits several defined codes (e.g., 18, 42, 44, 51, 63, 69). To avoid misdiagnosis, either (a) include the missing codes/ranges, or (b) label this as a partial list of common codes and link readers to parts/windows/windowscsehelper.ps1 for the authoritative set.
.github/skills/windows-log-analysis/sub-skills/analyze-extensions.md
Outdated
Show resolved
Hide resolved
- Fix HCS error code 0xC0370103/0x8037011F mapping: separate into two rows with clarifying note on HRESULT/NTSTATUS pairing uncertainty - Add HNS Error Codes section to common-reference.md (0x1392, 0x490, 0x57, 0x5) with note that no official HNS error reference exists - Add single vs. multi-snapshot guidance to common-reference.md - Add wcifs.sys kernel file handle leak pattern to analyze-hcs.md - Document HCSSHIM_TIMEOUT_* env vars in analyze-hcs.md - Add deployment context caveat to HCS container churn threshold - Add prominent Windows-specific note to analyze-kubelet.md: kubelet eviction is NOT implemented on Windows; DiskPressure/MemoryPressure won't auto-evict pods - Add containerfs.inodesFree log spam as known noise (k8s#130142) - Add note: no confirmed kubelet auto-restart watchdog on Windows - Add Windows CRI named pipe path to analyze-kubelet.md - Add container log rotation broken on Windows (containerd#7075) to analyze-containers.md and analyze-disk.md - Add CNI logs in System32 warning to analyze-hns.md (containerd#4928) - Note no official HNS error code reference in analyze-hns.md - Add explicit note to analyze-gmsa.md: kubelet logs will NOT contain gMSA/Kerberos errors, only CCG evtx logs will - Add explicit 3-skill quick triage path to SKILL.md - Reference save-markdown-to-disk skill in SKILL.md for report output Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ons.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…e.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| | File Pattern | Encoding | Contents | | ||
| |-------------|----------|----------| | ||
| | `kubectl-describe-nodes.log` | UTF-8 | `kubectl describe node` output | | ||
| | `<ts>-aks-info.log` | UTF-16-LE with BOM | `kubectl describe node` + node YAML (allocatable, capacity, conditions) | | ||
| | `kubelet.log` | UTF-8 | Kubelet stdout logs (if present) | |
There was a problem hiding this comment.
kubectl-describe-nodes.log is marked as UTF-8, but in collect-windows-logs.ps1 it is produced via PowerShell redirection (kubectl ... > file), which writes UTF-16LE (“Unicode”) by default on Windows PowerShell 5.1. Please update the expected encoding (or note that it may be UTF-16LE) so the analysis doesn’t mis-decode the file.
| | `<ts>-hnsdiag-list.txt` | analyze-hns, analyze-kubeproxy | | ||
| | `<ts>-aks-info.log` | analyze-bootstrap, analyze-memory, analyze-gpu | | ||
| | `<ts>-containerd-info.txt` | analyze-hcs | | ||
| | `<ts>-containerd-toml.txt` | analyze-hcs, analyze-images | |
There was a problem hiding this comment.
This skill references <ts>-aks-info.log, but collect-windows-logs.ps1 does not generate any *-aks-info.log file (it generates kubectl-describe-nodes.log / kubectl-get-nodes.log instead). Either update the collector to emit this file, or update the skill docs/file discovery table to use the actual bundle filenames.
| | `<ts>_services.csv` | UTF-16-LE with BOM, CSV with embedded newlines | Service Control Manager event log | | ||
| | `silconfig.log` | UTF-16-LE with BOM or UTF-8 | Software Inventory Logging configuration | | ||
| | `processes.txt` | UTF-16-LE with BOM | Running processes with PIDs | | ||
| | `kubectl-get-nodes.log` | UTF-8 | `kubectl get nodes -o wide` output | |
There was a problem hiding this comment.
kubectl-get-nodes.log is marked as UTF-8, but it is produced by collect-windows-logs.ps1 via PowerShell redirection (kubectl ... > file), which writes UTF-16LE by default on Windows PowerShell 5.1. Please update the encoding guidance (or mention it may be UTF-16LE).
| | `kubectl-get-nodes.log` | UTF-8 | `kubectl get nodes -o wide` output | | |
| | `kubectl-get-nodes.log` | UTF-16-LE with BOM or UTF-8 | `kubectl get nodes -o wide` output; often UTF-16-LE when collected via Windows PowerShell 5.1 redirection (`>`) | |
| | `available-memory.txt` | UTF-16-LE with BOM | Available physical RAM at collection time | | ||
| | `processes.txt` | UTF-16-LE with BOM | `Get-Process` snapshot — per-process memory usage | | ||
| | `<ts>_pagefile.txt` | UTF-16-LE with BOM | Pagefile configuration and usage (size, auto-managed, peak) | | ||
| | `<ts>_services.csv` | UTF-16-LE with BOM, CSV with embedded newlines | Event ID 2004 = low memory condition | | ||
| | `<ts>-aks-info.log` | UTF-16-LE with BOM | Node YAML with allocatable memory | | ||
|
|
There was a problem hiding this comment.
This references <ts>-aks-info.log, but the log bundle produced by collect-windows-logs.ps1 doesn’t include such a file. Consider switching this input to kubectl-describe-nodes.log (which the collector does generate) or documenting how <ts>-aks-info.log is expected to be produced.
| |-------------|----------|----------| | ||
| | `windowsnodereset.log` | UTF-8 or UTF-16-LE with BOM | Node reset/reimage flow log — full provisioning timeline | | ||
| | `bootstrap-config` | UTF-8 or UTF-16-LE with BOM | Bootstrap parameters passed to CSE (JSON or key-value) | | ||
| | `<ts>-aks-info.log` | UTF-16-LE with BOM | `kubectl describe node` + node YAML, component versions | |
There was a problem hiding this comment.
This references <ts>-aks-info.log, but the Windows bundle collector in this repo (collect-windows-logs.ps1) does not generate it. Either add it to the collector, or update this skill to rely on the existing kubectl-describe-nodes.log/kubectl-get-nodes.log outputs to extract versions and node YAML.
| | `<ts>-aks-info.log` | UTF-16-LE with BOM | `kubectl describe node` + node YAML, component versions | | |
| | `kubectl-describe-nodes.log` | UTF-8 or UTF-16-LE with BOM | `kubectl describe node` output for node conditions, taints, addresses, and event history | | |
| | `kubectl-get-nodes.log` | UTF-8 or UTF-16-LE with BOM | `kubectl get nodes` output (including wide/YAML forms when present) for Kubernetes versions and node object details | |
| | File Pattern | Encoding | Contents | | ||
| |-------------|----------|----------| | ||
| | `*-nvidia-smi.txt` or `*nvidia-smi*` | UTF-8 or UTF-16-LE with BOM | `nvidia-smi` output — GPU inventory, utilization, temperature, errors | | ||
| | `kubectl-describe-nodes.log` | UTF-8 | `kubectl describe node` — resource capacity/allocatable including GPU | |
There was a problem hiding this comment.
kubectl-describe-nodes.log is marked as UTF-8, but it’s produced via PowerShell redirection in collect-windows-logs.ps1 (Windows PowerShell 5.1), which writes UTF-16LE by default. Update the encoding expectations so GPU analysis can actually parse the file in real bundles.
| | `kubectl-describe-nodes.log` | UTF-8 | `kubectl describe node` — resource capacity/allocatable including GPU | | |
| | `kubectl-describe-nodes.log` | UTF-16-LE with BOM | `kubectl describe node` — resource capacity/allocatable including GPU | |
feat: add windows-log-analysis Copilot skill (LLM sub-skills)
Summary
Adds a Copilot CLI skill for diagnosing Windows AKS node issues from log bundles produced by
collect-windows-logs.ps1. This skill uses LLM sub-skill markdown files that instruct AI agents how to analyze each log category.It also adds a skill to save markdown to disk because my agent kept having so many issues with this task.
Why LLM sub-skills instead of scripts?
Architecture
What the skill detects
Orchestrator features
common-reference.mdincludes a dispatch table so agents pick the right 3-5 sub-skills instead of running all 16SKILL.mdprovides a full decision tree for combining findings across sub-skillsKey research that informed the sub-skills
Update-DefenderPreferencesmissing path exclusions forC:\ProgramData\containerdandcontainerd-shim-runhcs-v1.exeFiles changed
.github/skills/windows-log-analysis/SKILL.md— orchestrator with decision tree and root cause chains.github/skills/windows-log-analysis/sub-skills/*.md— 16 sub-skills + common reference (3,337 lines total).github/skills/windows-log-analysis/.gitignore