Skip to content

CI: add disk usage snapshot before BOLT cmake configuration (testing)#103908

Draft
leshikus wants to merge 1 commit intoClickHouse:masterfrom
leshikus:test-bolt-disk-usage-diag
Draft

CI: add disk usage snapshot before BOLT cmake configuration (testing)#103908
leshikus wants to merge 1 commit intoClickHouse:masterfrom
leshikus:test-bolt-disk-usage-diag

Conversation

@leshikus
Copy link
Copy Markdown
Contributor

@leshikus leshikus commented May 2, 2026

Testing PR to diagnose the BOLT toolchain build failure from #103307.

Every cmake compiler-feature test invokes clang-21.inst (BOLT-instrumented
clang), which writes a profile file per invocation. With hundreds of tests this
may fill several GB of disk before the real build starts, killing the runner
silently. This PR adds du / | sort -rn | head -100 and df -h right before
the BOLT cmake step to confirm whether disk exhaustion is the root cause.

The build_toolchain.py change also busts the toolchain cache so the
Build Toolchain (PGO, BOLT) (amd64) job actually runs in the PR workflow.

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Add `du / | sort -rn | head -100` and `df -h` right before the BOLT
profile collection CMake step to diagnose disk exhaustion. Every cmake
compiler-feature test invokes the BOLT-instrumented `clang-21.inst`,
which writes a profile file per invocation. With hundreds of tests this
may consume several GB before the real build starts and kill the runner
silently. The snapshot will confirm whether disk is the culprit.

Modifying this file also busts the toolchain cache so the
`Build Toolchain (PGO, BOLT)` job runs in the PR workflow.
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented May 2, 2026

Workflow [PR], commit [ab260e7]

Summary:


AI Review

Summary

This PR adds a pre-BOLT disk snapshot in build_toolchain.py to diagnose suspected runner disk exhaustion during CMake compiler-feature checks. Verdict: request changes — the new diagnostic command can itself worsen low-disk conditions and make the failing scenario less stable.

Missing context
  • ⚠️ No CI run logs/results were provided in this review context, so runtime impact is evaluated from code inspection only.
Findings
  • ❌ Blockers
    • [ci/jobs/build_toolchain.py:527] du / 2>/dev/null | sort -rn | head -100 scans the entire root tree and performs an unbounded sort. In the exact low-disk condition this PR is trying to diagnose, sort can spill temporary files and increase disk pressure, potentially masking root cause or causing additional failures.
      Suggested fix: constrain scan scope before sorting, e.g. du -x --max-depth=3 / 2>/dev/null | sort -rn | head -100 || true.
ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout ⚠️ Diagnostic command is not safe under disk-pressure conditions.
Compilation time
No large/binary files
Performance & Safety
  • The added command introduces high I/O and potentially high temporary-disk usage at a fragile point in the job; this is a safety/performance regression for the failure mode under investigation.
Final Verdict
  • Status: ⚠️ Request changes
  • Minimum required actions:
    1. Replace the unbounded root scan/sort command with a bounded, single-filesystem disk snapshot command before merge.

@clickhouse-gh clickhouse-gh Bot added the pr-ci label May 2, 2026
# consume several GB before the real build starts.
if bolt_ok:
print("=== Disk usage before BOLT cmake configuration ===")
Shell.check("du / 2>/dev/null | sort -rn | head -100 || true")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

du / 2>/dev/null | sort -rn | head -100 scans the whole root tree and then sorts potentially millions of lines. On a nearly-full runner (the exact case this PR is diagnosing), sort can spill temporary files to disk and make disk pressure worse or fail silently due to the trailing || true.

Please bound this snapshot to avoid self-inflicted pressure, e.g. use du -x --max-depth=<N> / (single filesystem + limited depth) before sorting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant