Skip to content

[https://nvbugs/6043291][fix] Add fatal error detection to prevent zombie worker pods#12718

Open
chienchunhung wants to merge 8 commits intoNVIDIA:mainfrom
chienchunhung:fix-zombie-worker-health-check
Open

[https://nvbugs/6043291][fix] Add fatal error detection to prevent zombie worker pods#12718
chienchunhung wants to merge 8 commits intoNVIDIA:mainfrom
chienchunhung:fix-zombie-worker-health-check

Conversation

@chienchunhung
Copy link
Copy Markdown
Collaborator

@chienchunhung chienchunhung commented Apr 2, 2026

Summary by CodeRabbit

  • New Features

    • Health checks now detect executor fatal states and report concise failure reasons.
    • Added error severity classification and a token‑bucket error budget to control escalation.
    • Background monitor thread watches worker futures and triggers shutdown on detected failures.
  • Bug Fixes

    • Fatal errors reliably initiate shutdown and cause active requests to fail.
    • Health endpoints surface fatal-error details and can trigger process shutdown when needed.
  • Tests

    • New comprehensive tests for error classification, health checks, monitor behavior, and endpoints.

Description

Summary

  • Adds fatal error detection throughout the executor stack so that unrecoverable errors (CUDA context corruption, OOM, NCCL failures, error budget exhaustion) trigger a clean shutdown instead of leaving worker pods in a zombie state.
  • Introduces a check_health() method on GenerationExecutor that drains the error queue directly (skipping per-request errors) and checks fatal error state, replacing the previous is_shutdown()-only checks.
  • Adds a background error monitor thread in GenerationExecutorProxy to detect MPI worker crashes even when no generate() calls or health checks are in flight.
  • Classifies errors into three tiers (immediate-fatal / severe / transient) with a token-bucket error budget that crashes fast on corrupted CUDA contexts while tolerating brief transient issues.

Problem

When a worker encounters an unrecoverable error (e.g. CUDA OOM or NCCL crash), the existing code only fails the in-flight requests but keeps the server alive. Kubernetes health probes continue to return 200 because they only check is_shutdown(), which remains False. The pod stays in Running state indefinitely — accepting new requests but never processing them.

Approach

PyExecutor (engine-level):

  • Three-tier error classification via standalone error_classification.py module (no CUDA/C++ dependencies):
    • Immediate-fatal (device-side assert, illegal address/memory access, launch failure) — crash on first occurrence
    • Severe (CUDA OOM, NCCL) — costs 5× budget per error
    • Transient — costs 1× budget per error
  • Token-bucket error budget (starts at 1.0, recovers at 0.1/s of error-free time). When exhausted, the error is promoted to fatal.
  • On fatal error: set is_shutdown = True immediately, fail all active requests (via a list copy to avoid aliased-list bug), and enqueue a shutdown request.

GenerationExecutor (base class):

  • _fatal_error field + _set_fatal_error() (first-error-wins).
  • check_health() drains the error queue directly via get_nowait() (not via _handle_background_error() which is documented for main-thread use). Per-request errors (RequestError, str) are skipped.
  • _handle_background_error() queue drain path also filters RequestError/str to avoid poisoning _fatal_error from per-request failures.
  • is_shutdown() also returns True when fatal error is set.

GenerationExecutorProxy (MPI multi-process):

  • check_health() inspects MPI worker futures for unexpected exits. Calls pre_shutdown() (not shutdown(), which would block on f.result() for surviving workers).
  • _error_monitor_loop daemon thread polls every ~5s, drains error queue directly, skips per-request errors. Also uses pre_shutdown().
  • pre_shutdown() sentinel condition fixed: all()any() so surviving workers still get the quit signal when one has already died.
  • Self-join guard: shutdown() skips _error_monitor_thread.join() when called from the monitor thread itself.

Serving layer:

  • GrpcRequestManager.health_check uses check_health() and surfaces a sanitized fatal error summary ("TypeName: first line") to clients; full error logged server-side.
  • OpenAIServer /health raises SIGINT on fatal error to terminate the process.
  • BaseLLM._check_health() delegates to executor.check_health().
  • Existing test_health updated to patch check_health() instead of is_shutdown().

Test coverage

Component Tests How
classify_error() 15 parametrized Tests the real module-level function directly (imported from error_classification.py via importlib without loading C++ extensions)
Token-bucket error budget 9 Immediate-fatal bypass, severe exhaustion in 2, transient exhaustion, time-based recovery, aliased-list fix verification, is_shutdown set on fatal
GenerationExecutor 10 _set_fatal_error first-wins, is_shutdown (4 states), check_health (4 states incl. error queue drain with per-request skip)
GenerationExecutorProxy 6 MPI worker future states (running/crashed/exited/cancelled), parent-unhealthy short-circuit
_error_monitor_loop 4 Worker crash detection, error queue detection, per-request string error skip, shutdown flag stop
gRPC health check 5 Parametrized: healthy, fatal, no executor, no LLM, shutdown
OpenAI /health 3 Parametrized: 200, 503, 503+SIGINT
BaseLLM._check_health 4 Parametrized delegation: healthy, fatal, no executor, missing attr

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@chienchunhung chienchunhung force-pushed the fix-zombie-worker-health-check branch from f32efd0 to 1a7c2d4 Compare April 3, 2026 00:03
@chienchunhung
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41528 [ run ] triggered by Bot. Commit: 985bd0f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41528 [ run ] completed with state SUCCESS. Commit: 985bd0f
/LLM/main/L0_MergeRequest_PR pipeline #32444 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

When the C++ executor crashes (e.g. CUDA OOM), the Python process
previously stayed alive with health checks returning 200, creating
zombie pods that accept connections but never produce tokens.

This adds defense-in-depth:
1. GenerationExecutor.check_health() drains error queue and checks
   fatal error state instead of only checking doing_shutdown
2. GenerationExecutorProxy checks MPI worker future liveness and
   runs a background error monitor thread for auto-shutdown
3. PyExecutor detects CUDA fatal errors and consecutive error loops,
   initiating shutdown instead of silently failing each batch
4. Health endpoint raises SIGINT on fatal error to terminate the
   process (same pattern as existing CppExecutorError handlers)

NVBug 6043291

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
- Join _error_monitor_thread during proxy shutdown to prevent
  pytest-threadleak detection of the daemon thread
- Update test_health to patch check_health() instead of is_shutdown()
  since BaseLLM._check_health() now delegates to executor.check_health()

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Made-with: Cursor
@chienchunhung chienchunhung force-pushed the fix-zombie-worker-health-check branch from c89534a to 6fb3968 Compare April 3, 2026 17:06
…udget

- Fix self-join deadlock: skip _error_monitor_thread.join() when
  shutdown() runs on the monitor thread itself
- Replace _handle_background_error() call in monitor loop with direct
  queue drain to avoid re-entrancy (documented for main-thread use)
- Replace simple consecutive error counter with token-bucket scheme:
  budget starts at 1.0, recovers at 0.1/s of error-free time, each
  transient error costs 0.1 and each severe error costs 0.5
- Split error classification into three tiers:
  - immediate_fatal: device-side assert, illegal address, launch failure
    — crash on first occurrence
  - severe: CUDA OOM, NCCL error — drains budget 5x faster
  - transient: all other errors — tolerated until budget exhausted
- Use epsilon comparison (< 1e-9) to avoid IEEE 754 rounding issues

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Made-with: Cursor
- Add Args/Returns docstrings to _classify_error, _is_fatal_error,
  _consume_error_budget, _handle_errors, check_health, _set_fatal_error
- Add type annotations to class attributes and return types
- Add inline comment block explaining the token-bucket fields
- Move time import to module level in proxy.py
- Parametrize tests: consolidate is_shutdown, check_health, proxy worker
  states, gRPC health, OpenAI health, and BaseLLM delegation into
  pytest.mark.parametrize for better coverage with fewer test methods
- Promote _make_request to module-level helper to remove duplication

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Made-with: Cursor
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Made-with: Cursor
@chienchunhung chienchunhung force-pushed the fix-zombie-worker-health-check branch from 3e5a9b1 to 5442515 Compare April 3, 2026 18:21
@chienchunhung
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41722 [ run ] triggered by Bot. Commit: 5442515 Link to invocation

@chienchunhung chienchunhung marked this pull request as ready for review April 3, 2026 18:28
@chienchunhung chienchunhung requested review from a team as code owners April 3, 2026 18:28
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 3, 2026

📝 Walkthrough

Walkthrough

Adds deterministic error classification and a token-bucket error budget to PyExecutor, records the first fatal background error at executor level, introduces a proxy monitor thread that surfaces fatal conditions, and propagates fatal-status checks through gRPC/OpenAI/LLM health endpoints to trigger coordinated shutdown.

Changes

Cohort / File(s) Summary
PyExecutor error handling
tensorrt_llm/_torch/pyexecutor/py_executor.py, tensorrt_llm/_torch/pyexecutor/error_classification.py
Added classify_error() module and integrated classification into PyExecutor. Introduced fatal-state tracking (_fatal_error) and a token-bucket error budget (_error_budget, recovery rate, cost), refactored _handle_errors() to consume budget, mark fatal, fail active requests, and enqueue shutdown requests.
Executor core health
tensorrt_llm/executor/executor.py
Added _fatal_error field and _set_fatal_error() to record the first unrecoverable error. is_shutdown() now reflects fatal state; added check_health() to non-blockingly inspect queued background errors and trigger fatal+shutdown when appropriate.
Executor proxy & monitor
tensorrt_llm/executor/proxy.py
Added background _error_monitor_thread running _error_monitor_loop() to poll mpi_futures and drain _error_queue, triggering _set_fatal_error() and pre_shutdown() on fatal events. Added check_health() to inspect futures and errors; updated shutdown() to join the monitor thread.
Service-layer integration
tensorrt_llm/grpc/grpc_request_manager.py, tensorrt_llm/llmapi/llm.py, tensorrt_llm/serve/openai_server.py
Health endpoints now call executor check_health(). gRPC health returns explicit fatal-error messages when executor _fatal_error exists. OpenAIServer logs fatal executor errors and raises SIGINT to initiate process shutdown when fatal is detected.
Tests
tests/unittest/executor/test_fatal_error_health_check.py, tests/unittest/llmapi/apps/_test_openai_metrics.py
Added extensive tests covering error classification, budget consumption and recovery, fatal-error propagation, _handle_errors() behavior, proxy monitor thread behavior, health-check flows (gRPC/OpenAI/BaseLLM), and updated a health negative-case to use check_health().

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant OpenAI as "OpenAI Server"
    participant LLM as "BaseLLM"
    participant Executor as "GenerationExecutor"
    participant ProxyMonitor as "Proxy Monitor Thread"
    participant PyExec as "PyExecutor"

    Client->>OpenAI: GET /health
    OpenAI->>LLM: _check_health()
    LLM->>Executor: check_health()
    Executor->>PyExec: inspect error queue / fatal state

    ProxyMonitor->>Executor: periodic poll (mpi_futures / error_queue)
    alt fatal condition detected
        ProxyMonitor->>Executor: _set_fatal_error(err)
        ProxyMonitor->>Executor: pre_shutdown() / shutdown()
        Executor->>PyExec: enqueue_shutdown_request()
    end

    Executor-->>LLM: check_health() = False (if fatal)
    LLM-->>OpenAI: unhealthy
    alt fatal present
        OpenAI->>OpenAI: signal.raise_signal(SIGINT)
    else
        OpenAI-->>Client: 503 LLM unavailable
    end
Loading
sequenceDiagram
    participant PyExec as "PyExecutor"
    participant Classifier as "classify_error()"
    participant Budget as "Error Budget"
    participant ReqQueue as "Executor Request Queue"

    PyExec->>Classifier: _classify_error(error_msg)
    Classifier-->>PyExec: immediate_fatal / severe / transient

    alt immediate_fatal
        PyExec->>PyExec: set _fatal_error immediately
    else severe or transient
        PyExec->>Budget: _consume_error_budget(severity)
        Budget-->>PyExec: budget_exhausted? (after recovery/deduction)
    end

    alt budget exhausted or immediate fatal
        PyExec->>ReqQueue: enqueue_shutdown_request()
        PyExec->>PyExec: mark is_shutdown, fail/complete active requests
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 23.53% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: adding fatal error detection to prevent zombie worker pods, directly addressing the core problem described in the PR.
Description check ✅ Passed PR description is comprehensive and well-structured, covering problem statement, technical approach, test coverage, and checklist completion.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

3496-3523: ⚠️ Potential issue | 🔴 Critical

Fatal errors need to stop local scheduling immediately.

After setting _fatal_error, Line 3522 only enqueues a shutdown request. This class's loop guards still key off self.is_shutdown, not _fatal_error (Lines 858-860), so requests already sitting in waiting_queue — or ahead of the shutdown sentinel in executor_request_queue — can still be activated on the next iteration and run after a fatal CUDA error. Please flip the local shutdown state and fail/drain pending queued work in the same fatal path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 3496 - 3523,
After setting self._fatal_error in the fatal-error path, immediately set the
local shutdown flag (self.is_shutdown = True) and fail/drain any pending work in
waiting_queue (and any requests still in executor_request_queue if applicable)
similar to how active_requests are handled: mark each queued request
state=LlmRequestState.GENERATION_COMPLETE, populate error_responses for their
py_request_id (use the same error_msg and client id logic used for
active_requests), call self._enqueue_responses on those items, and call
self._terminate_request for each; only after draining waiting_queue and
owner-level queues call executor_request_queue.enqueue_shutdown_request() so no
queued requests get scheduled after a fatal CUDA error.

3502-3519: ⚠️ Potential issue | 🔴 Critical

Copy active_requests before clearing them.

When requests is None, Line 3502 aliases failed_requests to self.active_requests. Line 3511 then clears that list before the termination loop, so Lines 3518-3519 never free resources for the failed requests. Broad failures from _forward_step(), _sample_async(), _update_requests(), etc. will therefore leak KV/sequence state.

Suggested fix
-        failed_requests = requests if requests is not None else self.active_requests
+        failed_requests = list(requests) if requests is not None else list(
+            self.active_requests
+        )
         for request in failed_requests:
             req_id = request.py_request_id
             request.state = LlmRequestState.GENERATION_COMPLETE
             error_responses[req_id] = LlmResponse(
                 request_id=req_id,
@@
         if requests is None:
             self.active_requests.clear()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 3502 - 3519, The
loop currently aliases failed_requests to self.active_requests when requests is
None, then clears self.active_requests before calling _terminate_request,
causing resource leaks; change the assignment so that failed_requests is a
shallow copy (e.g., failed_requests = list(self.active_requests)) when requests
is None, keep the existing branch that filters self.active_requests when
requests is provided, and ensure the subsequent calls to _enqueue_responses and
_terminate_request iterate over that copied failed_requests so
_terminate_request(request) runs for each failed request (reference
variables/methods: failed_requests, self.active_requests, _enqueue_responses,
_terminate_request).
🧹 Nitpick comments (1)
tensorrt_llm/executor/executor.py (1)

318-320: Keep fatal crashes distinguishable from graceful shutdown.

Now that is_shutdown() returns True for both cases, callers like BaseLLM.generate_async() collapse fatal engine failures into the generic "LLM is shutting down" error. Consider exposing the latched fatal error separately so request paths can surface the real cause while still using is_shutdown() for control flow.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/executor/executor.py` around lines 318 - 320, is_shutdown()
currently conflates graceful shutdown and fatal crashes by returning True
whenever self._fatal_error is set; add a separate accessor (e.g., a read-only
property or method named fatal_error or get_fatal_error) that returns the
latched exception (self._fatal_error) while leaving is_shutdown() unchanged, and
update call sites such as BaseLLM.generate_async to first check
executor.fatal_error and surface/raise that exception if present before treating
is_shutdown() as a generic shutdown case. Ensure the accessor returns None when
there is no fatal error and do not change the existing boolean semantics of
is_shutdown().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 3473-3495: The helper _handle_errors currently always calls
self._consume_error_budget and charges the process-wide fatal budget even for
request-local failures; change _handle_errors(signature) to accept an optional
flag (e.g., charge_process_budget: bool = True) and only call
self._consume_error_budget(error_msg) when charge_process_budget is True (leave
behavior unchanged by default). Update all call sites that represent
request-scoped/user-controlled failures (the validation error path that calls
_handle_errors from the validation code, the KV-transfer timeout caller, and the
guided-decoder failure caller) to pass charge_process_budget=False so those do
not decrement the shared budget; keep fatal-shutdown callers using the default
True. Ensure references to symbols: _handle_errors, _consume_error_budget, and
active_requests are preserved and that tests cover both charged and uncharged
failure paths.
- Around line 3377-3411: The _classify_error logic currently treats "illegal
memory access" as only a severe error because _IMMEDIATE_FATAL_PATTERNS lacks
that text, so add patterns like "illegal memory access" and "illegal address"
(or a single substring that matches the common CUDA/PyTorch phrasing) to
_IMMEDIATE_FATAL_PATTERNS and remove/avoid duplicating it in
_SEVERE_ERROR_PATTERNS; update the classification unit tests to assert that an
error message containing "CUDA error: an illegal memory access was encountered"
(and similar variants) returns "immediate_fatal" from _classify_error,
referencing the _classify_error method and the
_IMMEDIATE_FATAL_PATTERNS/_SEVERE_ERROR_PATTERNS lists.

In `@tensorrt_llm/executor/executor.py`:
- Around line 104-106: The TOCTOU race in Executor._set_fatal_error() must be
fixed by making the check-and-set atomic with a lock: add a threading.Lock
(e.g., self._fatal_error_lock) in the Executor.__init__ and wrap the body of
_set_fatal_error() with that lock so the "if self._fatal_error is None:
self._fatal_error = error" sequence is performed under the lock; this ensures
calls from _error_monitor_loop() and _handle_background_error() cannot overwrite
the first error and preserves the documented first-error-wins behavior.

In `@tensorrt_llm/executor/proxy.py`:
- Around line 148-157: The health check path that iterates self.mpi_futures (in
check_health) must not call shutdown() inline when a future is done; instead,
when a done() future is detected capture its exception (using f.exception() if
not f.cancelled()), call self._set_fatal_error(error), and trigger the
non-blocking pre-shutdown path (call the pre_shutdown mechanism or schedule it)
without waiting on other futures or calling shutdown() synchronously; ensure
doing_shutdown is set/checked to avoid double-triggering and avoid letting the
queued worker exception be re-raised by the synchronous shutdown path so
check_health() can reliably return False.

In `@tensorrt_llm/grpc/grpc_request_manager.py`:
- Around line 176-180: The health-check branch in grpc_request_manager.py
currently returns self.llm._executor._fatal_error verbatim (which may include
full tracebacks via CppExecutorError.__str__), so change the return to a
sanitized summary: build a short string like "<ErrorType>: <short message>"
using type(self.llm._executor._fatal_error).__name__ and a trimmed message
(e.g., str(self.llm._executor._fatal_error).splitlines()[0]) and return that to
gRPC clients; keep logging the full error/traceback to your logger (e.g.,
process or executor logger) for diagnostics but do not include the stack in
HealthCheckResponse.status. Ensure this change is applied where the code checks
self.llm._executor.check_health() and references
self.llm._executor._fatal_error.

---

Outside diff comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 3496-3523: After setting self._fatal_error in the fatal-error
path, immediately set the local shutdown flag (self.is_shutdown = True) and
fail/drain any pending work in waiting_queue (and any requests still in
executor_request_queue if applicable) similar to how active_requests are
handled: mark each queued request state=LlmRequestState.GENERATION_COMPLETE,
populate error_responses for their py_request_id (use the same error_msg and
client id logic used for active_requests), call self._enqueue_responses on those
items, and call self._terminate_request for each; only after draining
waiting_queue and owner-level queues call
executor_request_queue.enqueue_shutdown_request() so no queued requests get
scheduled after a fatal CUDA error.
- Around line 3502-3519: The loop currently aliases failed_requests to
self.active_requests when requests is None, then clears self.active_requests
before calling _terminate_request, causing resource leaks; change the assignment
so that failed_requests is a shallow copy (e.g., failed_requests =
list(self.active_requests)) when requests is None, keep the existing branch that
filters self.active_requests when requests is provided, and ensure the
subsequent calls to _enqueue_responses and _terminate_request iterate over that
copied failed_requests so _terminate_request(request) runs for each failed
request (reference variables/methods: failed_requests, self.active_requests,
_enqueue_responses, _terminate_request).

---

Nitpick comments:
In `@tensorrt_llm/executor/executor.py`:
- Around line 318-320: is_shutdown() currently conflates graceful shutdown and
fatal crashes by returning True whenever self._fatal_error is set; add a
separate accessor (e.g., a read-only property or method named fatal_error or
get_fatal_error) that returns the latched exception (self._fatal_error) while
leaving is_shutdown() unchanged, and update call sites such as
BaseLLM.generate_async to first check executor.fatal_error and surface/raise
that exception if present before treating is_shutdown() as a generic shutdown
case. Ensure the accessor returns None when there is no fatal error and do not
change the existing boolean semantics of is_shutdown().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e74ae06f-199f-4f3d-9fd7-3375887e229f

📥 Commits

Reviewing files that changed from the base of the PR and between 3ac2704 and 5442515.

📒 Files selected for processing (8)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/executor/executor.py
  • tensorrt_llm/executor/proxy.py
  • tensorrt_llm/grpc/grpc_request_manager.py
  • tensorrt_llm/llmapi/llm.py
  • tensorrt_llm/serve/openai_server.py
  • tests/unittest/executor/test_fatal_error_health_check.py
  • tests/unittest/llmapi/apps/_test_openai_metrics.py

Critical fixes:
- Fix aliased list bug: _handle_errors() iterated self.active_requests
  then called clear(), so _terminate_request never ran on the fatal
  path.  Now copies the list before clearing.
- Set is_shutdown=True immediately on fatal error to prevent the
  executor loop from scheduling more requests on a corrupted context.

Robustness improvements:
- Monitor loop and base check_health() now skip per-request errors
  (RequestError, str) from the error queue instead of promoting them
  to fatal — a single bad request should not crash the server.
- Base check_health() drains the error queue directly instead of
  calling _handle_background_error() (avoids re-entrancy/thread issues
  since health checks run from the event loop thread, not main).

Testability:
- Extract classify_error() and pattern constants into standalone
  error_classification.py module (no CUDA/C++ dependencies).
- Tests import and test the real classify_error() function directly.
- Fix pattern: "cudaerrorillegaladd" -> "cudaerrorillegaladdress".

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Made-with: Cursor
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Made-with: Cursor
- Add "illegal memory access" to immediate-fatal patterns so
  "CUDA error: an illegal memory access" is not misclassified as
  severe (matched "cuda error" first)
- Sanitize _fatal_error in gRPC health response: return
  "TypeName: first line" instead of full traceback to clients;
  log full error to server logger for diagnostics
- Fix test mock ConcreteExecutor.check_health() to skip str errors
  (matching production behavior)
- Fix docstring: "below zero" -> "below a near-zero threshold"

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Made-with: Cursor
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Made-with: Cursor
@chienchunhung
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 3, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@chienchunhung
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41730 [ run ] triggered by Bot. Commit: e5e25c1 Link to invocation

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

3458-3487: ⚠️ Potential issue | 🟠 Major

Fatal shutdown still leaves queued work runnable.

Line 3460 sets is_shutdown, but this path only clears active_requests. should_stop_processing at Lines 859-861 still waits for waiting_queue to empty, and enqueue_shutdown_request() in tensorrt_llm/_torch/pyexecutor/executor_request_queue.py, Lines 133-136 only appends a shutdown item. Already-buffered requests can therefore be promoted and scheduled after the fatal error.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 3458 - 3487, The
fatal-shutdown path sets self._fatal_error and clears active_requests but leaves
already-buffered work in the executor_request_queue.waiting_queue eligible for
promotion; to fix this, when is_fatal is true (inside the fatal branch around
self._fatal_error/self.is_shutdown) also drain or invalidate the
executor_request_queue.waiting_queue so no pending requests can be promoted:
iterate over executor_request_queue.waiting_queue, mark each as failed (create
error LlmResponse entries and call _terminate_request as done for
active_requests), then clear waiting_queue (or add a new
executor_request_queue.clear_waiting_queue() and call it) before calling
enqueue_shutdown_request(); alternatively modify enqueue_shutdown_request() to
atomically flush/invalidate waiting_queue when a fatal flag is present (check
self._fatal_error) so no buffered requests are scheduled after fatal.
♻️ Duplicate comments (3)
tensorrt_llm/executor/executor.py (1)

304-316: ⚠️ Potential issue | 🔴 Critical

Make _set_fatal_error() actually first-error-wins.

Lines 314-316 are still a check-then-set race. _error_monitor_loop() runs on a background thread, while _handle_background_error() and check_health() can call the same setter from request/health paths, so the second writer can overwrite the first fatal cause.

#!/bin/bash
set -euo pipefail

echo "Fatal setter call sites:"
rg -n -C2 '\b_set_fatal_error\s*\(' tensorrt_llm/executor

echo
echo "Threaded callers and thread creation:"
rg -n -C3 '_error_monitor_loop|threading\.Thread|check_health\(|_handle_background_error\(' tensorrt_llm/executor
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/executor/executor.py` around lines 304 - 316, The current
_set_fatal_error(self, error) does a racey check-then-set; make it truly
first-error-wins by protecting the check-and-assign with a mutex (e.g., add a
threading.Lock instance on the Executor and acquire it inside _set_fatal_error)
or by using an atomic compare-and-swap pattern; specifically, create a lock like
self._fatal_lock in the Executor initializer, then in _set_fatal_error() do with
self._fatal_lock: if self._fatal_error is None: self._fatal_error = error;
logger.error(...). Update only _set_fatal_error (and the Executor init to add
the lock) so calls from _error_monitor_loop, _handle_background_error, and
check_health all respect the same mutual exclusion and the first writer wins.
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

3435-3457: ⚠️ Potential issue | 🟠 Major

Don't charge request-local failures against the process-wide fatal budget.

Line 3456 still burns the shared bucket for every _handle_errors() call, but _fetch_and_activate_new_requests() (Line 2747), _handle_responses() (Lines 3648-3650), and _handle_guided_decoder_errors() (Line 3851) use this helper for request-scoped failures. A burst of bad requests can still flip the whole worker fatal.

Possible direction
 def _handle_errors(self,
                    error_msg: Optional[str] = None,
                    *,
-                   requests: Optional[List[LlmRequest]] = None) -> None:
+                   requests: Optional[List[LlmRequest]] = None,
+                   charge_process_budget: bool = True) -> None:
@@
-        is_fatal = self._consume_error_budget(error_msg)
+        is_fatal = (
+            self._consume_error_budget(error_msg)
+            if charge_process_budget else False
+        )

Then pass charge_process_budget=False from the request-local call sites.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 3435 - 3457, The
_handle_errors helper currently always calls
self._consume_error_budget(error_msg) which charges the shared process-wide
fatal budget even for request-local failures; change _handle_errors to accept an
optional flag (e.g., charge_process_budget: bool = True) and only call
self._consume_error_budget(error_msg) when that flag is True, leaving
request-local failures uncharged; then update the request-scoped call sites
_fetch_and_activate_new_requests, _handle_responses, and
_handle_guided_decoder_errors to pass charge_process_budget=False when invoking
_handle_errors so only true process-level faults consume the shared budget.
tensorrt_llm/executor/proxy.py (1)

149-157: ⚠️ Potential issue | 🔴 Critical

Don't run full shutdown() inline from the worker-crash health paths.

Lines 154-156 and 188-189 call shutdown() as soon as one MPI future is done. But pre_shutdown() still only sends the quit sentinel when all futures are alive (Line 396), so surviving workers never get notified and shutdown() blocks on the remaining f.result() calls. This path can also re-raise out of shutdown() instead of reliably reporting unhealthy.

Safer direction
-                    if not self.doing_shutdown:
-                        self.shutdown()
+                    if not self.doing_shutdown:
+                        self.pre_shutdown()
                     return False
...
-                            if not self.doing_shutdown:
-                                self.shutdown()
+                            if not self.doing_shutdown:
+                                self.pre_shutdown()
                             return
-        if all(not f.done() for f in self.mpi_futures):
+        if any(not f.done() for f in self.mpi_futures):
             self.request_queue.put_noblock(None, retry=4)
#!/bin/bash
set -euo pipefail

echo "Health/monitor paths that call shutdown:"
rg -n -C2 'f\.done\(\)|self\.shutdown\(\)' tensorrt_llm/executor/proxy.py

echo
echo "pre_shutdown sentinel condition:"
sed -n '383,399p' tensorrt_llm/executor/proxy.py

echo
echo "shutdown wait loop:"
sed -n '399,421p' tensorrt_llm/executor/proxy.py

Also applies to: 178-190

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/executor/proxy.py` around lines 149 - 157, The health-check loop
that inspects self.mpi_futures must not call self.shutdown() inline because
shutdown() waits for remaining futures and blocks when survivors never receive
the quit sentinel; instead, when a future is done set the fatal error via
self._set_fatal_error(exc_or_runtimeerror), mark the intent to shutdown by
toggling self.doing_shutdown = True (if not already), and return False so the
higher-level monitor/cleanup path can perform an orderly shutdown (which will
call pre_shutdown() and send sentinels correctly). Update both places that call
self.shutdown() from the mpi_futures health checks (the loop using for f in
self.mpi_futures and the other health-check block at lines ~178-190) to follow
this pattern and avoid re-raising from within shutdown().
🧹 Nitpick comments (2)
tests/unittest/executor/test_fatal_error_health_check.py (2)

126-134: Consider catching specific exception instead of bare Exception.

The get_nowait() call can only raise queue.Empty. While this mock mirrors production code patterns, using a specific exception makes the test mock's behavior clearer.

♻️ Suggested fix
+from queue import Queue, Empty
 ...
             try:
                 e = self._error_queue.get_nowait()
                 self._error_queue.task_done()
                 if not isinstance(e, str):
                     self._set_fatal_error(e)
                     self.doing_shutdown = True
-            except Exception:
+            except Empty:
                 pass
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/executor/test_fatal_error_health_check.py` around lines 126 -
134, Replace the bare except in the block that calls
self._error_queue.get_nowait() with a specific catch for queue.Empty (or Empty
if imported) so only the expected no-item case is swallowed; update imports if
necessary to reference queue.Empty, leaving the logic that sets
_set_fatal_error(e) and toggles doing_shutdown unchanged.

403-455: Thread timing in tests could be sensitive to CI load.

The monitor loop tests use fixed sleep times (time.sleep(0.05), time.sleep(0.2)) for thread synchronization. While these work in normal conditions, they could be flaky in slow CI environments. The 2-second join timeouts provide good safety margins.

If flakiness is observed, consider using synchronization primitives (e.g., threading.Event) instead of fixed sleeps for more deterministic behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/executor/test_fatal_error_health_check.py` around lines 403 -
455, Replace fragile time.sleep calls with threading.Event-based
synchronization: create an Event (e.g., started = threading.Event()) and start
the monitor thread with a wrapper target that sets started.set() then calls
executor._error_monitor_loop(); replace time.sleep(0.05) with
started.wait(timeout=1.0) to ensure the monitor is running; instead of
time.sleep(0.2) wait up to a short timeout in a loop for the expected condition
(e.g., wait for executor._fatal_error to become non-None or for
executor.doing_shutdown/_shutdown_called to change) so tests like
test_detects_worker_crash, test_detects_error_queue_item, and
test_skips_per_request_string_errors use deterministic events and condition
waits while still joining the thread afterwards.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/executor/executor.py`:
- Around line 296-300: When draining _error_queue, don't unconditionally mark
every dequeued item fatal; mirror the per-request filtering used elsewhere (see
the RequestError/str handling in the submit/init branch and check_health()). In
the block that processes items from self._error_queue in executor.py, inspect
each dequeued error and only call self._set_fatal_error(e) and self.shutdown()
for errors that are not instances of RequestError or str (treat those as
request-local and skip marking fatal), ensuring you still call task_done() for
each item.

---

Outside diff comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 3458-3487: The fatal-shutdown path sets self._fatal_error and
clears active_requests but leaves already-buffered work in the
executor_request_queue.waiting_queue eligible for promotion; to fix this, when
is_fatal is true (inside the fatal branch around
self._fatal_error/self.is_shutdown) also drain or invalidate the
executor_request_queue.waiting_queue so no pending requests can be promoted:
iterate over executor_request_queue.waiting_queue, mark each as failed (create
error LlmResponse entries and call _terminate_request as done for
active_requests), then clear waiting_queue (or add a new
executor_request_queue.clear_waiting_queue() and call it) before calling
enqueue_shutdown_request(); alternatively modify enqueue_shutdown_request() to
atomically flush/invalidate waiting_queue when a fatal flag is present (check
self._fatal_error) so no buffered requests are scheduled after fatal.

---

Duplicate comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 3435-3457: The _handle_errors helper currently always calls
self._consume_error_budget(error_msg) which charges the shared process-wide
fatal budget even for request-local failures; change _handle_errors to accept an
optional flag (e.g., charge_process_budget: bool = True) and only call
self._consume_error_budget(error_msg) when that flag is True, leaving
request-local failures uncharged; then update the request-scoped call sites
_fetch_and_activate_new_requests, _handle_responses, and
_handle_guided_decoder_errors to pass charge_process_budget=False when invoking
_handle_errors so only true process-level faults consume the shared budget.

In `@tensorrt_llm/executor/executor.py`:
- Around line 304-316: The current _set_fatal_error(self, error) does a racey
check-then-set; make it truly first-error-wins by protecting the
check-and-assign with a mutex (e.g., add a threading.Lock instance on the
Executor and acquire it inside _set_fatal_error) or by using an atomic
compare-and-swap pattern; specifically, create a lock like self._fatal_lock in
the Executor initializer, then in _set_fatal_error() do with self._fatal_lock:
if self._fatal_error is None: self._fatal_error = error; logger.error(...).
Update only _set_fatal_error (and the Executor init to add the lock) so calls
from _error_monitor_loop, _handle_background_error, and check_health all respect
the same mutual exclusion and the first writer wins.

In `@tensorrt_llm/executor/proxy.py`:
- Around line 149-157: The health-check loop that inspects self.mpi_futures must
not call self.shutdown() inline because shutdown() waits for remaining futures
and blocks when survivors never receive the quit sentinel; instead, when a
future is done set the fatal error via
self._set_fatal_error(exc_or_runtimeerror), mark the intent to shutdown by
toggling self.doing_shutdown = True (if not already), and return False so the
higher-level monitor/cleanup path can perform an orderly shutdown (which will
call pre_shutdown() and send sentinels correctly). Update both places that call
self.shutdown() from the mpi_futures health checks (the loop using for f in
self.mpi_futures and the other health-check block at lines ~178-190) to follow
this pattern and avoid re-raising from within shutdown().

---

Nitpick comments:
In `@tests/unittest/executor/test_fatal_error_health_check.py`:
- Around line 126-134: Replace the bare except in the block that calls
self._error_queue.get_nowait() with a specific catch for queue.Empty (or Empty
if imported) so only the expected no-item case is swallowed; update imports if
necessary to reference queue.Empty, leaving the logic that sets
_set_fatal_error(e) and toggles doing_shutdown unchanged.
- Around line 403-455: Replace fragile time.sleep calls with
threading.Event-based synchronization: create an Event (e.g., started =
threading.Event()) and start the monitor thread with a wrapper target that sets
started.set() then calls executor._error_monitor_loop(); replace
time.sleep(0.05) with started.wait(timeout=1.0) to ensure the monitor is
running; instead of time.sleep(0.2) wait up to a short timeout in a loop for the
expected condition (e.g., wait for executor._fatal_error to become non-None or
for executor.doing_shutdown/_shutdown_called to change) so tests like
test_detects_worker_crash, test_detects_error_queue_item, and
test_skips_per_request_string_errors use deterministic events and condition
waits while still joining the thread afterwards.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 996f84b2-399b-4653-8d3d-4755442fb2cf

📥 Commits

Reviewing files that changed from the base of the PR and between 5442515 and e5e25c1.

📒 Files selected for processing (6)
  • tensorrt_llm/_torch/pyexecutor/error_classification.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/executor/executor.py
  • tensorrt_llm/executor/proxy.py
  • tensorrt_llm/grpc/grpc_request_manager.py
  • tests/unittest/executor/test_fatal_error_health_check.py

@chienchunhung chienchunhung requested a review from a team as a code owner April 3, 2026 21:21
@chienchunhung chienchunhung requested a review from QiJune April 3, 2026 21:21
@chienchunhung chienchunhung force-pushed the fix-zombie-worker-health-check branch from 026c8de to 5d12cea Compare April 3, 2026 21:24
@chienchunhung
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 3, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
tests/unittest/executor/test_fatal_error_health_check.py (1)

122-135: Mock's check_health() doesn't fully mirror production RequestError filtering.

The production check_health() in executor.py filters both str and RequestError as per-request errors (line 352: if not isinstance(e, (str, RequestError))), but this mock only checks for str (line 129). Consider adding RequestError filtering to match production behavior more closely, or document this simplification.

Suggested fix
+# Import or define a mock RequestError for test accuracy
+class MockRequestError(Exception):
+    pass
+
 class ConcreteExecutor:
     ...
     def check_health(self) -> bool:
         if self.doing_shutdown or self._fatal_error is not None:
             return False
         if not self._error_queue.empty():
             try:
                 e = self._error_queue.get_nowait()
                 self._error_queue.task_done()
-                if not isinstance(e, str):
+                if not isinstance(e, (str, MockRequestError)):
                     self._set_fatal_error(e)
                     self.doing_shutdown = True
             except Empty:
                 pass
             return self._fatal_error is None and not self.doing_shutdown
         return True
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/executor/test_fatal_error_health_check.py` around lines 122 -
135, The mock's check_health method currently treats only str as per-request
errors; update it to mirror production by treating both str and RequestError as
non-fatal per-request errors: when pulling an item from _error_queue in
check_health, check isinstance(e, (str, RequestError)) and only call
_set_fatal_error and set doing_shutdown when the exception is not one of those;
adjust imports or references so RequestError type is available to the test/mock.
tensorrt_llm/executor/executor.py (1)

348-357: Narrow the exception handler in check_health() to avoid silently swallowing unexpected errors.

The except Exception: pass block is too broad. It's intended to catch queue.Empty from get_nowait(), but will also silently swallow any other unexpected exception (e.g., if _set_fatal_error or shutdown raises). Consider catching Empty specifically.

Suggested fix
+from queue import Empty
 ...
         if not self._error_queue.empty():
             try:
                 e = self._error_queue.get_nowait()
                 self._error_queue.task_done()
                 if not isinstance(e, (str, RequestError)):
                     self._set_fatal_error(e)
                     self.shutdown()
-            except Exception:
+            except Empty:
                 pass
             return self._fatal_error is None and not self.doing_shutdown
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/executor/executor.py` around lines 348 - 357, In check_health(),
narrow the broad "except Exception: pass" around self._error_queue.get_nowait()
to only catch the queue.Empty exception (or queue.Empty and any specific
expected exceptions) so you don't silently swallow errors from _set_fatal_error
or shutdown; update the except clause to "except queue.Empty:" (and add the
appropriate import/reference to queue.Empty) and let other exceptions propagate
so they can be handled/logged by callers.
tensorrt_llm/executor/proxy.py (1)

198-215: Consider narrowing inner exception handler to Empty.

The inner except Exception: pass at lines 210-211 is intended to catch queue.Empty from get_nowait(), but will also swallow unexpected errors. The outer handler at lines 214-215 is appropriate for monitor resilience.

Suggested fix
                 if not self._error_queue.empty():
                     try:
                         e = self._error_queue.get_nowait()
                         self._error_queue.task_done()
                         if isinstance(e, (str, RequestError)):
                             continue
                         self._set_fatal_error(e)
                         logger.error(
                             "Error monitor: background error detected: "
                             f"{e!r}")
                         if not self.doing_shutdown:
                             self.pre_shutdown()
-                    except Exception:
+                    except Empty:
                         pass
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/executor/proxy.py` around lines 198 - 215, The inner exception
handler around self._error_queue.get_nowait() is too broad and should only catch
queue.Empty to avoid swallowing real errors; change the inner except Exception
to except queue.Empty (or import Empty and use except Empty) so only the
expected empty-queue case is ignored, while leaving the outer broad except
Exception intact for monitor resilience; keep references to
self._error_queue.get_nowait(), self._error_queue.task_done(),
self._set_fatal_error(), self.pre_shutdown(), self._fatal_error and
self.doing_shutdown when applying the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tensorrt_llm/executor/executor.py`:
- Around line 348-357: In check_health(), narrow the broad "except Exception:
pass" around self._error_queue.get_nowait() to only catch the queue.Empty
exception (or queue.Empty and any specific expected exceptions) so you don't
silently swallow errors from _set_fatal_error or shutdown; update the except
clause to "except queue.Empty:" (and add the appropriate import/reference to
queue.Empty) and let other exceptions propagate so they can be handled/logged by
callers.

In `@tensorrt_llm/executor/proxy.py`:
- Around line 198-215: The inner exception handler around
self._error_queue.get_nowait() is too broad and should only catch queue.Empty to
avoid swallowing real errors; change the inner except Exception to except
queue.Empty (or import Empty and use except Empty) so only the expected
empty-queue case is ignored, while leaving the outer broad except Exception
intact for monitor resilience; keep references to
self._error_queue.get_nowait(), self._error_queue.task_done(),
self._set_fatal_error(), self.pre_shutdown(), self._fatal_error and
self.doing_shutdown when applying the change.

In `@tests/unittest/executor/test_fatal_error_health_check.py`:
- Around line 122-135: The mock's check_health method currently treats only str
as per-request errors; update it to mirror production by treating both str and
RequestError as non-fatal per-request errors: when pulling an item from
_error_queue in check_health, check isinstance(e, (str, RequestError)) and only
call _set_fatal_error and set doing_shutdown when the exception is not one of
those; adjust imports or references so RequestError type is available to the
test/mock.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 69cb9bdd-c2c7-449a-9293-ea9a19f51a06

📥 Commits

Reviewing files that changed from the base of the PR and between e5e25c1 and 5d12cea.

📒 Files selected for processing (3)
  • tensorrt_llm/executor/executor.py
  • tensorrt_llm/executor/proxy.py
  • tests/unittest/executor/test_fatal_error_health_check.py

- _handle_background_error() queue drain now filters RequestError/str
  (mirrors check_health/monitor behavior, prevents per-request errors
  from poisoning _fatal_error)
- Use pre_shutdown() instead of shutdown() in proxy check_health()
  and _error_monitor_loop() to avoid blocking on f.result() for
  surviving workers
- Fix pre_shutdown sentinel: all() -> any() so surviving workers
  get the quit signal even when one has already died
- Test mocks: except Exception -> except Empty, shutdown -> pre_shutdown

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Made-with: Cursor
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Made-with: Cursor
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Made-with: Cursor
@chienchunhung chienchunhung force-pushed the fix-zombie-worker-health-check branch from 5d12cea to 97a81ef Compare April 3, 2026 21:53
@chienchunhung
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41750 [ run ] triggered by Bot. Commit: 97a81ef Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41750 [ run ] completed with state SUCCESS. Commit: 97a81ef
/LLM/main/L0_MergeRequest_PR pipeline #32649 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants