Do not report test-runner abortions and per-test timeouts as Server died#106506
Do not report test-runner abortions and per-test timeouts as Server died#106506leshikus wants to merge 2 commits into
Conversation
Follow-up to ClickHouse#105643 (ClickHouse#105643), which folded every mid-flight abort exit code into a single `ABORTED_RUN_EXIT_CODES` set and attached a synthetic `Server died` leaf for all of them. That conflated two unrelated conditions and produced hundreds of spurious `Server died` rows from ordinary job-level timeouts and flaky checks (reported on ClickHouse#105019 and ClickHouse#104965), where the server was healthy and shut down cleanly. The two conditions are now split: - `SERVER_DIED_EXIT_CODES` = `{STOP_TESTING_EXIT_CODE}`. This is the in-band signal that the server actually died or the hung-check tripped: `clickhouse-test` raised `StopTesting`, reached its outer handler and exited deliberately. Behaviour is unchanged - the partial per-test results are demoted (the crash may have caused them) and a `Server died` leaf is added. - `RUNNER_ABORTED_EXIT_CODES` = `{143, 137, -15, -9}`. These mean `clickhouse-test` was terminated by a signal, not that the server died. In particular 143 is `clickhouse-test`'s own exit code: its `signal_handler` turns SIGTERM into a `Terminated` exception and `__main__` exits `128 + signal`. The SIGTERM comes from a job-level wall-clock timeout, a runner shutdown, or the worker -> parent feedback loop in `stop_tests`. These now get a `clickhouse-test` leaf (the same name the end-of-run path already uses for other non-zero exits) reading "clickhouse-test was terminated before finishing", and the per-test results that completed before the kill are kept authoritative instead of being demoted - a real failure that happened before the abort stays visible. The bugfix-validation inverter is unaffected: it flips any FAIL leaf to OK, so an abort while exercising the new regression test still reads as "bug reproduced", preserving the intent of ClickHouse#105643. Adds `ci/tests/test_functional_tests_results.py` covering both branches and extends the inverter test for the new leaf. Reports: - https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=105019&sha=latest&name_0=PR&name_1=Stateless+tests+%28arm_asan_ubsan%2C+targeted%29 - https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104965&sha=b8200879b832ff77f5a82d8d8e5c77efe10d0353&name_0=PR&name_1=Stateless%20tests%20%28amd_asan_ubsan%2C%20flaky%20check%29 Related: ClickHouse#105643 Related: ClickHouse#106228
…r died Refines the previous commit after working through where each "timeout" actually originates. `clickhouse-test` (`tests/clickhouse-test`): The per-test SIGALRM backstop (the watchdog for completely frozen tests, armed at `int(args.timeout * 1.1) + 60`) used to call `stop_tests` from inside the signal handler. That `killpg` SIGTERMs the runner's own process group, so the run exits 143 - which the job side then misreports as an aborted run / "Server died". Instead, the timed-out test is now recorded as a normal per-test `TestResult(FAIL, FailureReason.TIMEOUT)` through the usual `process_result` path, exactly like the existing query-level and `socket.timeout` paths already do. It surfaces as a plain `[ FAIL ]` line with `Reason: Timeout!`, the run continues (bounded by `--max-failures-chain`), and no exit-code-based reclassification is needed on the job side. The test's client runs in its own session (`start_new_session=True`), so it is still reaped by `--cleanup`, not by the removed `killpg`. Parser (`ci/jobs/scripts/functional_tests_results.py`): - A `RUNNER_ABORTED_EXIT_CODES` run (143/137/-15/-9 - external kill, runner shutdown, the worker->parent SIGTERM feedback loop) is now reported as `ERROR` with a `clickhouse-test` leaf, not `FAIL`. It is the same "the runner did not finish" condition as the existing `not s.success_finish` branch, and it is inconclusive rather than a test producing a wrong answer. As `ERROR` it also routes the bugfix-validation inverter down its "preserve, do not invert" path: an abort is reported as inconclusive instead of being silently claimed as "bug reproduced" (a kill does not prove the bug reproduced). - Removed the dead `order` keys `SERVER_DIED` / `Timeout` / `BROKEN` (no leaf ever has those statuses - the synthetic "Server died" leaf is status `FAIL`) and added the real `ERROR` status, which was missing and silently sorted to the front. Tests updated accordingly; added coverage for the abort -> ERROR behaviour and the inverter preserving it. Related: ClickHouse#105643 Related: ClickHouse#106228
|
Workflow [PR], commit [ec8e646] Summary: ✅ AI ReviewSummaryThis PR changes functional-test result classification so Findings
Tests
Missing context / blind spots
Final VerdictStatus: Minimum required actions: fix the per-test alarm handling so it reports the intended timeout failure without leaking the client process group, and cover that path with a focused regression test. |
| @@ -4449,6 +4455,21 @@ def run_tests_array( | |||
| test_result = test_case.run(args, test_suite, client_options) | |||
| test_result = test_case.process_result(test_result, MESSAGES) | |||
| except TimeoutError: | |||
There was a problem hiding this comment.
This branch does not catch the actual per-test alarm in the usual path. The signal handler raises Python's built-in TimeoutError while execution is inside test_case.run, but TestCase.run catches it with the broad except Exception at tests/clickhouse-test:3636 and returns TestStatus.UNKNOWN/FailureReason.INTERNAL_ERROR; control never reaches this except TimeoutError. As a result a frozen test is reported as [ UNKNOWN ] with Reason: Test internal error, not as the promised normal [ FAIL ] with Reason: Timeout!.
Please handle the alarm before the generic exception path, or move the timeout result construction into TestCase.run while still killing/reaping the still-running client process group. A focused test that triggers this alarm path would prevent this from regressing.
Motivation
Follow-up to #105643, which folded every mid-flight abort exit code into a
single
ABORTED_RUN_EXIT_CODESset and attached a syntheticServer diedleaf to all of them. That conflated unrelated conditions and produced
hundreds of spurious
Server diedrows from ordinary job-level timeouts andflaky/targeted checks, where the server was healthy and shut down cleanly
(reported by @Algunenano; see e.g. the targeted run on #105019 and the flaky
check on #104965, and the alternative #106228).
Working through where each "timeout" actually originates showed there is no
single signal to fix - the cases are genuinely different and should be
reported differently.
Changes
tests/clickhouse-test— the per-test SIGALRM backstop (the watchdog forcompletely frozen tests, armed at
int(args.timeout * 1.1) + 60) used tocall
stop_testsfrom inside the signal handler. ThatkillpgSIGTERMs therunner's own process group, so the run exits
143, which the job side thenmisreports as an aborted run /
Server died. The timed-out test is nowrecorded as a normal per-test
TestResult(FAIL, FailureReason.TIMEOUT)through the usual
process_resultpath - exactly like the existingquery-level and
socket.timeoutpaths already do - so it surfaces as a plain[ FAIL ]line withReason: Timeout!. No exit-code-based reclassificationis needed on the job side. The test's client runs in its own session
(
start_new_session=True), so it is still reaped by--cleanup, not by theremoved
killpg.ci/jobs/scripts/functional_tests_results.py:STOP_TESTING_EXIT_CODE(the in-band signal that the server actually diedor the hung-check tripped) keeps the
Server diedleaf and the demotion ofpartial per-test results.
143/137/-15/-9(job-level timeout, runnershutdown, the worker→parent SIGTERM feedback loop) - is now reported as
ERRORwith aclickhouse-testleaf, notServer died. It is the same"the runner did not finish" condition as the existing
not s.success_finishbranch, and the per-test results that completed before the kill are kept
authoritative (no
UNKNOWNdemotion). AsERRORit also routes thebugfix-validation inverter down its "preserve, do not invert" path: an abort
is reported as inconclusive rather than silently claimed as "bug reproduced"
(a kill does not prove the bug reproduced).
orderkeysSERVER_DIED/Timeout/BROKEN(no leafever has those statuses) and added the real
ERRORstatus, which wasmissing and silently sorted to the front.
Caveat
Dropping
stop_testsfrom the SIGALRM handler means a single frozen test nolonger aborts the whole run via
killpg- it is aFAILand the runcontinues, bounded by
--max-failures-chain. This is strictly more resilientthan before (previously one frozen test killed the entire run with
143). Theouter
Shell.runsafety-net timeout still backstops a genuinely stuck run.Changelog category (leave one):