Do not report test-runner abortions and per-test timeouts as Server died by leshikus · Pull Request #106506 · ClickHouse/ClickHouse

leshikus · 2026-06-04T22:32:36Z

Motivation

Follow-up to #105643, which folded every mid-flight abort exit code into a
single ABORTED_RUN_EXIT_CODES set and attached a synthetic Server died
leaf to all of them. That conflated unrelated conditions and produced
hundreds of spurious Server died rows from ordinary job-level timeouts and
flaky/targeted checks, where the server was healthy and shut down cleanly
(reported by @Algunenano; see e.g. the targeted run on #105019 and the flaky
check on #104965, and the alternative #106228).

Working through where each "timeout" actually originates showed there is no
single signal to fix - the cases are genuinely different and should be
reported differently.

Changes

tests/clickhouse-test — the per-test SIGALRM backstop (the watchdog for
completely frozen tests, armed at int(args.timeout * 1.1) + 60) used to
call stop_tests from inside the signal handler. That killpg SIGTERMs the
runner's own process group, so the run exits 143, which the job side then
misreports as an aborted run / Server died. The timed-out test is now
recorded as a normal per-test TestResult(FAIL, FailureReason.TIMEOUT)
through the usual process_result path - exactly like the existing
query-level and socket.timeout paths already do - so it surfaces as a plain
[ FAIL ] line with Reason: Timeout!. No exit-code-based reclassification
is needed on the job side. The test's client runs in its own session
(start_new_session=True), so it is still reaped by --cleanup, not by the
removed killpg.

ci/jobs/scripts/functional_tests_results.py:

STOP_TESTING_EXIT_CODE (the in-band signal that the server actually died
or the hung-check tripped) keeps the Server died leaf and the demotion of
partial per-test results.
An external kill - 143 / 137 / -15 / -9 (job-level timeout, runner
shutdown, the worker→parent SIGTERM feedback loop) - is now reported as
ERROR with a clickhouse-test leaf, not Server died. It is the same
"the runner did not finish" condition as the existing not s.success_finish
branch, and the per-test results that completed before the kill are kept
authoritative (no UNKNOWN demotion). As ERROR it also routes the
bugfix-validation inverter down its "preserve, do not invert" path: an abort
is reported as inconclusive rather than silently claimed as "bug reproduced"
(a kill does not prove the bug reproduced).
Removed the dead order keys SERVER_DIED / Timeout / BROKEN (no leaf
ever has those statuses) and added the real ERROR status, which was
missing and silently sorted to the front.

Caveat

Dropping stop_tests from the SIGALRM handler means a single frozen test no
longer aborts the whole run via killpg - it is a FAIL and the run
continues, bounded by --max-failures-chain. This is strictly more resilient
than before (previously one frozen test killed the entire run with 143). The
outer Shell.run safety-net timeout still backstops a genuinely stuck run.

Changelog category (leave one):

CI Fix or Improvement (changelog entry is not required)

Follow-up to ClickHouse#105643 (ClickHouse#105643), which folded every mid-flight abort exit code into a single `ABORTED_RUN_EXIT_CODES` set and attached a synthetic `Server died` leaf for all of them. That conflated two unrelated conditions and produced hundreds of spurious `Server died` rows from ordinary job-level timeouts and flaky checks (reported on ClickHouse#105019 and ClickHouse#104965), where the server was healthy and shut down cleanly. The two conditions are now split: - `SERVER_DIED_EXIT_CODES` = `{STOP_TESTING_EXIT_CODE}`. This is the in-band signal that the server actually died or the hung-check tripped: `clickhouse-test` raised `StopTesting`, reached its outer handler and exited deliberately. Behaviour is unchanged - the partial per-test results are demoted (the crash may have caused them) and a `Server died` leaf is added. - `RUNNER_ABORTED_EXIT_CODES` = `{143, 137, -15, -9}`. These mean `clickhouse-test` was terminated by a signal, not that the server died. In particular 143 is `clickhouse-test`'s own exit code: its `signal_handler` turns SIGTERM into a `Terminated` exception and `__main__` exits `128 + signal`. The SIGTERM comes from a job-level wall-clock timeout, a runner shutdown, or the worker -> parent feedback loop in `stop_tests`. These now get a `clickhouse-test` leaf (the same name the end-of-run path already uses for other non-zero exits) reading "clickhouse-test was terminated before finishing", and the per-test results that completed before the kill are kept authoritative instead of being demoted - a real failure that happened before the abort stays visible. The bugfix-validation inverter is unaffected: it flips any FAIL leaf to OK, so an abort while exercising the new regression test still reads as "bug reproduced", preserving the intent of ClickHouse#105643. Adds `ci/tests/test_functional_tests_results.py` covering both branches and extends the inverter test for the new leaf. Reports: - https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=105019&sha=latest&name_0=PR&name_1=Stateless+tests+%28arm_asan_ubsan%2C+targeted%29 - https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104965&sha=b8200879b832ff77f5a82d8d8e5c77efe10d0353&name_0=PR&name_1=Stateless%20tests%20%28amd_asan_ubsan%2C%20flaky%20check%29 Related: ClickHouse#105643 Related: ClickHouse#106228

…r died Refines the previous commit after working through where each "timeout" actually originates. `clickhouse-test` (`tests/clickhouse-test`): The per-test SIGALRM backstop (the watchdog for completely frozen tests, armed at `int(args.timeout * 1.1) + 60`) used to call `stop_tests` from inside the signal handler. That `killpg` SIGTERMs the runner's own process group, so the run exits 143 - which the job side then misreports as an aborted run / "Server died". Instead, the timed-out test is now recorded as a normal per-test `TestResult(FAIL, FailureReason.TIMEOUT)` through the usual `process_result` path, exactly like the existing query-level and `socket.timeout` paths already do. It surfaces as a plain `[ FAIL ]` line with `Reason: Timeout!`, the run continues (bounded by `--max-failures-chain`), and no exit-code-based reclassification is needed on the job side. The test's client runs in its own session (`start_new_session=True`), so it is still reaped by `--cleanup`, not by the removed `killpg`. Parser (`ci/jobs/scripts/functional_tests_results.py`): - A `RUNNER_ABORTED_EXIT_CODES` run (143/137/-15/-9 - external kill, runner shutdown, the worker->parent SIGTERM feedback loop) is now reported as `ERROR` with a `clickhouse-test` leaf, not `FAIL`. It is the same "the runner did not finish" condition as the existing `not s.success_finish` branch, and it is inconclusive rather than a test producing a wrong answer. As `ERROR` it also routes the bugfix-validation inverter down its "preserve, do not invert" path: an abort is reported as inconclusive instead of being silently claimed as "bug reproduced" (a kill does not prove the bug reproduced). - Removed the dead `order` keys `SERVER_DIED` / `Timeout` / `BROKEN` (no leaf ever has those statuses - the synthetic "Server died" leaf is status `FAIL`) and added the real `ERROR` status, which was missing and silently sorted to the front. Tests updated accordingly; added coverage for the abort -> ERROR behaviour and the inverter preserving it. Related: ClickHouse#105643 Related: ClickHouse#106228

clickhouse-gh · 2026-06-04T22:33:11Z

Workflow [PR], commit [ec8e646]

Summary: ✅

AI Review

Summary

This PR changes functional-test result classification so STOP_TESTING_EXIT_CODE remains a real Server died, external runner kills become an inconclusive ERROR, and a per-test SIGALRM timeout should become a normal timeout failure. I found one correctness issue in the new per-test timeout path, so this should be fixed before merge.

Findings

⚠️ Majors

[tests/clickhouse-test:4457] The new except TimeoutError branch does not catch the actual per-test alarm in the usual path. timeout_handler raises Python's built-in TimeoutError while execution is inside test_case.run, but TestCase.run catches it with the broad except Exception at tests/clickhouse-test:3636 and returns TestStatus.UNKNOWN / FailureReason.INTERNAL_ERROR. A frozen test is therefore reported as [ UNKNOWN ] with Reason: Test internal error, not the promised normal [ FAIL ] with Reason: Timeout!. Move timeout handling before the broad exception path, or construct the timeout TestResult inside TestCase.run while still killing/reaping the still-running client process group.

Tests

⚠️ Add a focused regression test that triggers the per-test SIGALRM timeout path and verifies the emitted result is [ FAIL ] with Reason: Timeout!, not [ UNKNOWN ].

Missing context / blind spots

⚠️ The current Praktika S3 report for this PR has no test results yet (Total: 0), so I could not validate this against a completed CI run. A completed CI run with the new tests would close this gap.

Final Verdict

Status: ⚠️ Request changes

Minimum required actions: fix the per-test alarm handling so it reports the intended timeout failure without leaking the client process group, and cover that path with a focused regression test.

clickhouse-gh · 2026-06-04T22:37:49Z

@@ -4449,6 +4455,21 @@ def run_tests_array(
                    test_result = test_case.run(args, test_suite, client_options)
                    test_result = test_case.process_result(test_result, MESSAGES)
                except TimeoutError:


This branch does not catch the actual per-test alarm in the usual path. The signal handler raises Python's built-in TimeoutError while execution is inside test_case.run, but TestCase.run catches it with the broad except Exception at tests/clickhouse-test:3636 and returns TestStatus.UNKNOWN/FailureReason.INTERNAL_ERROR; control never reaches this except TimeoutError. As a result a frozen test is reported as [ UNKNOWN ] with Reason: Test internal error, not as the promised normal [ FAIL ] with Reason: Timeout!.

Please handle the alarm before the generic exception path, or move the timeout result construction into TestCase.run while still killing/reaping the still-running client process group. A focused test that triggers this alarm path would prevent this from regressing.

leshikus added 2 commits June 4, 2026 23:11

leshikus added the can be tested Allows running workflows for external contributors label Jun 4, 2026

clickhouse-gh Bot added the pr-ci label Jun 4, 2026

clickhouse-gh Bot reviewed Jun 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not report test-runner abortions and per-test timeouts as Server died#106506

Do not report test-runner abortions and per-test timeouts as Server died#106506
leshikus wants to merge 2 commits into
ClickHouse:masterfrom
leshikus:ci/distinguish-runner-abort-from-server-died

leshikus commented Jun 4, 2026

Uh oh!

clickhouse-gh Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

clickhouse-gh Bot Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leshikus commented Jun 4, 2026

Motivation

Changes

Caveat

Changelog category (leave one):

Uh oh!

clickhouse-gh Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Findings

Tests

Missing context / blind spots

Final Verdict

Uh oh!

clickhouse-gh Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

clickhouse-gh Bot commented Jun 4, 2026 •

edited

Loading