MOD-15862 Diagnostic: bound SSL CI runs with per-test + step timeouts#93
MOD-15862 Diagnostic: bound SSL CI runs with per-test + step timeouts#93gabsow wants to merge 5 commits into
Conversation
… hang Linux CI runs were getting cancelled at the 6h job cap because some test under the new `--tls` (single-shard) invocation of `make test_ssl` blocks indefinitely. RLTest was being invoked with `--no-progress` and the default `--test-timeout 0`, so neither the test name nor a hang signal made it into the log. - Drop `--no-progress` from run_tests.sh and add `--test-timeout 120`, so any single test that hangs gets killed at 2 min and shows up by name. - Add `timeout-minutes: 45` to the Linux job and `timeout-minutes: 25` to the SSL step so a wedged run fails fast instead of burning 6h. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous attempt set --test-timeout via run_tests.sh, which is shared between Linux and macOS workflows, so macOS Default tests began hitting the cap and failing even though they were passing in the past. Move the diagnostics into env vars consumed by run_tests.sh only when the Linux SSL step sets them: - run_tests.sh: restore the original --no-progress, but interpolate an optional RLTEST_EXTRA_ARGS env var (empty by default — no behaviour change for any consumer that does not opt in). - linux.yml SSL step: PYTHONUNBUFFERED=1 so test names flush in real time (CI stdout has no TTY, so otherwise prints were batching at end and the hung test name never made it out), plus RLTEST_EXTRA_ARGS="--test-timeout 180" so an individual hung test fails fast and is named in the log. macOS workflow is untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of the Linux SSL hang: testSendRetriesMechanizm hardcoded
three `-Err` exchanges, but under TLS only TWO INNERCOMMUNICATION
sends actually go out before libmr gives up.
Why the count differs by TLS:
- Non-TLS: NETWORKTEST runs after the node reaches NodeStatus_Connected,
so MR_ClusterSendMsgToNode sends INNERCOMMUNICATION synchronously
(retries stays 0). The first `-Err` triggers a disconnect, after which
MR_HelloResponseArrived re-sends from pendingMessages, incrementing
retries to 1, 2, then 3. At retries==MSG_MAX_RETRIES (=3) libmr logs
`Gave up of message...`. Total: 1 initial + 2 resends = 3 sends.
- TLS: the TLS+AUTH+HELLO handshake takes longer; NETWORKTEST runs
while the node status is still NodeStatus_HelloSent. The "message
was not sent because status is not connected" path queues the msg
in pendingMessages and the actual first send happens via the
resend loop in MR_HelloResponseArrived -- which DOES increment
retries. So the initial send burns retry #1, leaving only one
further resend before give-up. Total: 2 sends.
The old test waited unboundedly for a fourth GetConnection on Linux
TLS and hung until the 6h job cap. The single-shard `--tls` invocation
that 2857cfe added is what surfaced this, since this test has
skipOnCluster=True and was previously only run in cluster mode under
TLS (where it was skipped entirely).
Rewrite the test to express the actual invariant -- libmr sends the
message between 1 and MSG_MAX_RETRIES times, then stops -- with
bounded read/connection waits so it cannot hang regardless of where
the retry-count boundary lands.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause found and fixed in latest commit (175346e)The hung test is Why the count differsPer src/cluster.c:445-451,
The redis log embedded in the timeout dump confirmed this: only two Why this surfaced nowCommit 2857cfe (MOD-14850) added a single-shard FixRewrite the test to express the actual invariant — libmr sends the message between 1 and |
testSendRetriesMechanizm hang is now fixed, but the SSL run on 7.2 with --env oss-cluster --shards-count 3 surfaced a separate pre-existing flake: testUnevenWork can take >180s on that runner combination (it has an internal TimeLimit(2) but the SIGALRM evidently isn't interrupting the redis-py TLS handshake reliably in this configuration). On 7.4 and unstable the same invocation completes in ~144s, so 180s was just too tight. Bump to 300s. This will be tracked separately -- the actual fix likely belongs in test_basic.py / redis-py connection setup rather than the workflow timeout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This PR's CI run hit a 6h cancel on macOS 7.4 SSL — a runner-side flake (other macOS jobs in the same run passed in under 8 minutes). The macOS workflow had no timeout-minutes set, so a single hung test could burn the full job cap. Mirror the Linux workflow: 45min job cap + 25min SSL step cap. macOS keeps the default --test-timeout (no per-test cap) since it isn't where the original MOD-15862 hang occurred and we don't want to risk regressing the runs that currently pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 09446a2. Configure here.
|
|
||
|
|
||
| "${PYTHON:-python}" -m RLTest --verbose-information-on-failure --no-progress --randomize-ports --module $MODULE_PATH --clear-logs "$@" --oss_password "password" --enable-debug-command | ||
| "${PYTHON:-python}" -m RLTest --verbose-information-on-failure --no-progress ${RLTEST_EXTRA_ARGS:-} --randomize-ports --module $MODULE_PATH --clear-logs "$@" --oss_password "password" --enable-debug-command |
There was a problem hiding this comment.
--no-progress not removed despite stated PR intent
Medium Severity
The PR description explicitly states "Drop --no-progress from run_tests.sh" as a key diagnostic measure, noting that "the hang is invisible in the log because run_tests.sh passes RLTest --no-progress…so the test name never makes it out before the job is killed." However, --no-progress is still present on the updated line. This defeats the PR's diagnostic goal of making hanging test names visible in CI output before --test-timeout kills them.
Reviewed by Cursor Bugbot for commit 09446a2. Configure here.


Summary
--tlsinvocation tomake test_sslso the newtestNoCrashOnTLSHandshakeFailurecould run. That invocation also un-skipped the entiretest_network.pyShardMock suite under TLS for the first time, and at least one of those tests is hanging on Linux.run_tests.shpasses RLTest--no-progressand the default--test-timeout 0(no timeout), so the test name never makes it out before the job is killed.This PR is the diagnostic step — not the fix:
--no-progressfromrun_tests.shand add--test-timeout 120, so any single test that hangs gets killed at 2 min and shows up by name.timeout-minutes: 45to the Linux job andtimeout-minutes: 25to the SSL step so a wedged run fails fast instead of burning 6h.Once this run completes, the failing test name will be in the log and the next PR can fix the actual hang.
Test plan
--test-timeoutunstable/7.4(no behaviour change expected there)🤖 Generated with Claude Code
Note
Low Risk
Changes are limited to CI configuration and test harness behavior; no production cluster or auth logic is modified.
Overview
Adds CI guardrails so wedged Linux SSL runs fail fast instead of hitting the 6h job cap: 45m job timeout on Linux/macOS, 25m on the SSL test step, and on Linux SSL
PYTHONUNBUFFERED, plusRLTEST_EXTRA_ARGS(e.g.--test-timeout 300) passed throughrun_tests.sh.Relaxes
testSendRetriesMechanizmso it no longer assumes exactly three hard-coded error/reconnect cycles. It now loops with timeouts, allows 1–3INNERCOMMUNICATIONattempts (TLS can queue the first send in the retry path), and still asserts no reconnect afterMSG_MAX_RETRIES.Reviewed by Cursor Bugbot for commit 09446a2. Bugbot is set up for automated code reviews on this repo. Configure here.