Skip to content

Do not report test-runner abortions and per-test timeouts as Server died#106506

Draft
leshikus wants to merge 2 commits into
ClickHouse:masterfrom
leshikus:ci/distinguish-runner-abort-from-server-died
Draft

Do not report test-runner abortions and per-test timeouts as Server died#106506
leshikus wants to merge 2 commits into
ClickHouse:masterfrom
leshikus:ci/distinguish-runner-abort-from-server-died

Conversation

@leshikus

@leshikus leshikus commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Motivation

Follow-up to #105643, which folded every mid-flight abort exit code into a
single ABORTED_RUN_EXIT_CODES set and attached a synthetic Server died
leaf to all of them. That conflated unrelated conditions and produced
hundreds of spurious Server died rows from ordinary job-level timeouts and
flaky/targeted checks, where the server was healthy and shut down cleanly
(reported by @Algunenano; see e.g. the targeted run on #105019 and the flaky
check on #104965, and the alternative #106228).

Working through where each "timeout" actually originates showed there is no
single signal to fix - the cases are genuinely different and should be
reported differently.

Changes

tests/clickhouse-test — the per-test SIGALRM backstop (the watchdog for
completely frozen tests, armed at int(args.timeout * 1.1) + 60) used to
call stop_tests from inside the signal handler. That killpg SIGTERMs the
runner's own process group, so the run exits 143, which the job side then
misreports as an aborted run / Server died. The timed-out test is now
recorded as a normal per-test TestResult(FAIL, FailureReason.TIMEOUT)
through the usual process_result path - exactly like the existing
query-level and socket.timeout paths already do - so it surfaces as a plain
[ FAIL ] line with Reason: Timeout!. No exit-code-based reclassification
is needed on the job side. The test's client runs in its own session
(start_new_session=True), so it is still reaped by --cleanup, not by the
removed killpg.

ci/jobs/scripts/functional_tests_results.py:

  • STOP_TESTING_EXIT_CODE (the in-band signal that the server actually died
    or the hung-check tripped) keeps the Server died leaf and the demotion of
    partial per-test results.
  • An external kill - 143 / 137 / -15 / -9 (job-level timeout, runner
    shutdown, the worker→parent SIGTERM feedback loop) - is now reported as
    ERROR with a clickhouse-test leaf, not Server died. It is the same
    "the runner did not finish" condition as the existing not s.success_finish
    branch, and the per-test results that completed before the kill are kept
    authoritative (no UNKNOWN demotion). As ERROR it also routes the
    bugfix-validation inverter down its "preserve, do not invert" path: an abort
    is reported as inconclusive rather than silently claimed as "bug reproduced"
    (a kill does not prove the bug reproduced).
  • Removed the dead order keys SERVER_DIED / Timeout / BROKEN (no leaf
    ever has those statuses) and added the real ERROR status, which was
    missing and silently sorted to the front.

Caveat

Dropping stop_tests from the SIGALRM handler means a single frozen test no
longer aborts the whole run via killpg - it is a FAIL and the run
continues, bounded by --max-failures-chain. This is strictly more resilient
than before (previously one frozen test killed the entire run with 143). The
outer Shell.run safety-net timeout still backstops a genuinely stuck run.

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

leshikus added 2 commits June 4, 2026 23:11
Follow-up to ClickHouse#105643 (ClickHouse#105643),
which folded every mid-flight abort exit code into a single
`ABORTED_RUN_EXIT_CODES` set and attached a synthetic `Server died` leaf
for all of them. That conflated two unrelated conditions and produced
hundreds of spurious `Server died` rows from ordinary job-level timeouts
and flaky checks (reported on ClickHouse#105019 and ClickHouse#104965), where the server was
healthy and shut down cleanly.

The two conditions are now split:

- `SERVER_DIED_EXIT_CODES` = `{STOP_TESTING_EXIT_CODE}`. This is the
  in-band signal that the server actually died or the hung-check tripped:
  `clickhouse-test` raised `StopTesting`, reached its outer handler and
  exited deliberately. Behaviour is unchanged - the partial per-test
  results are demoted (the crash may have caused them) and a `Server died`
  leaf is added.

- `RUNNER_ABORTED_EXIT_CODES` = `{143, 137, -15, -9}`. These mean
  `clickhouse-test` was terminated by a signal, not that the server died.
  In particular 143 is `clickhouse-test`'s own exit code: its
  `signal_handler` turns SIGTERM into a `Terminated` exception and
  `__main__` exits `128 + signal`. The SIGTERM comes from a job-level
  wall-clock timeout, a runner shutdown, or the worker -> parent feedback
  loop in `stop_tests`. These now get a `clickhouse-test` leaf (the same
  name the end-of-run path already uses for other non-zero exits) reading
  "clickhouse-test was terminated before finishing", and the per-test
  results that completed before the kill are kept authoritative instead of
  being demoted - a real failure that happened before the abort stays
  visible.

The bugfix-validation inverter is unaffected: it flips any FAIL leaf to OK,
so an abort while exercising the new regression test still reads as "bug
reproduced", preserving the intent of ClickHouse#105643.

Adds `ci/tests/test_functional_tests_results.py` covering both branches and
extends the inverter test for the new leaf.

Reports:
- https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=105019&sha=latest&name_0=PR&name_1=Stateless+tests+%28arm_asan_ubsan%2C+targeted%29
- https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104965&sha=b8200879b832ff77f5a82d8d8e5c77efe10d0353&name_0=PR&name_1=Stateless%20tests%20%28amd_asan_ubsan%2C%20flaky%20check%29

Related: ClickHouse#105643
Related: ClickHouse#106228
…r died

Refines the previous commit after working through where each "timeout"
actually originates.

`clickhouse-test` (`tests/clickhouse-test`):
The per-test SIGALRM backstop (the watchdog for completely frozen tests,
armed at `int(args.timeout * 1.1) + 60`) used to call `stop_tests` from
inside the signal handler. That `killpg` SIGTERMs the runner's own process
group, so the run exits 143 - which the job side then misreports as an
aborted run / "Server died". Instead, the timed-out test is now recorded
as a normal per-test `TestResult(FAIL, FailureReason.TIMEOUT)` through the
usual `process_result` path, exactly like the existing query-level and
`socket.timeout` paths already do. It surfaces as a plain `[ FAIL ]` line
with `Reason: Timeout!`, the run continues (bounded by
`--max-failures-chain`), and no exit-code-based reclassification is needed
on the job side. The test's client runs in its own session
(`start_new_session=True`), so it is still reaped by `--cleanup`, not by
the removed `killpg`.

Parser (`ci/jobs/scripts/functional_tests_results.py`):
- A `RUNNER_ABORTED_EXIT_CODES` run (143/137/-15/-9 - external kill, runner
  shutdown, the worker->parent SIGTERM feedback loop) is now reported as
  `ERROR` with a `clickhouse-test` leaf, not `FAIL`. It is the same "the
  runner did not finish" condition as the existing `not s.success_finish`
  branch, and it is inconclusive rather than a test producing a wrong
  answer. As `ERROR` it also routes the bugfix-validation inverter down its
  "preserve, do not invert" path: an abort is reported as inconclusive
  instead of being silently claimed as "bug reproduced" (a kill does not
  prove the bug reproduced).
- Removed the dead `order` keys `SERVER_DIED` / `Timeout` / `BROKEN` (no
  leaf ever has those statuses - the synthetic "Server died" leaf is
  status `FAIL`) and added the real `ERROR` status, which was missing and
  silently sorted to the front.

Tests updated accordingly; added coverage for the abort -> ERROR behaviour
and the inverter preserving it.

Related: ClickHouse#105643
Related: ClickHouse#106228
@leshikus leshikus added the can be tested Allows running workflows for external contributors label Jun 4, 2026
@clickhouse-gh

clickhouse-gh Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [ec8e646]

Summary:


AI Review

Summary

This PR changes functional-test result classification so STOP_TESTING_EXIT_CODE remains a real Server died, external runner kills become an inconclusive ERROR, and a per-test SIGALRM timeout should become a normal timeout failure. I found one correctness issue in the new per-test timeout path, so this should be fixed before merge.

Findings

⚠️ Majors

  • [tests/clickhouse-test:4457] The new except TimeoutError branch does not catch the actual per-test alarm in the usual path. timeout_handler raises Python's built-in TimeoutError while execution is inside test_case.run, but TestCase.run catches it with the broad except Exception at tests/clickhouse-test:3636 and returns TestStatus.UNKNOWN / FailureReason.INTERNAL_ERROR. A frozen test is therefore reported as [ UNKNOWN ] with Reason: Test internal error, not the promised normal [ FAIL ] with Reason: Timeout!. Move timeout handling before the broad exception path, or construct the timeout TestResult inside TestCase.run while still killing/reaping the still-running client process group.
Tests

⚠️ Add a focused regression test that triggers the per-test SIGALRM timeout path and verifies the emitted result is [ FAIL ] with Reason: Timeout!, not [ UNKNOWN ].

Missing context / blind spots

⚠️ The current Praktika S3 report for this PR has no test results yet (Total: 0), so I could not validate this against a completed CI run. A completed CI run with the new tests would close this gap.

Final Verdict

Status: ⚠️ Request changes

Minimum required actions: fix the per-test alarm handling so it reports the intended timeout failure without leaking the client process group, and cover that path with a focused regression test.

@clickhouse-gh clickhouse-gh Bot added the pr-ci label Jun 4, 2026
Comment thread tests/clickhouse-test
@@ -4449,6 +4455,21 @@ def run_tests_array(
test_result = test_case.run(args, test_suite, client_options)
test_result = test_case.process_result(test_result, MESSAGES)
except TimeoutError:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch does not catch the actual per-test alarm in the usual path. The signal handler raises Python's built-in TimeoutError while execution is inside test_case.run, but TestCase.run catches it with the broad except Exception at tests/clickhouse-test:3636 and returns TestStatus.UNKNOWN/FailureReason.INTERNAL_ERROR; control never reaches this except TimeoutError. As a result a frozen test is reported as [ UNKNOWN ] with Reason: Test internal error, not as the promised normal [ FAIL ] with Reason: Timeout!.

Please handle the alarm before the generic exception path, or move the timeout result construction into TestCase.run while still killing/reaping the still-running client process group. A focused test that triggers this alarm path would prevent this from regressing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant