Skip to content

[None][fix] Fix GIL race in hostfunc nanobind bindings causing intermittent segfault#12957

Open
ssam18 wants to merge 1 commit intoNVIDIA:mainfrom
ssam18:fix/hostfunc-nanobind-gil-race
Open

[None][fix] Fix GIL race in hostfunc nanobind bindings causing intermittent segfault#12957
ssam18 wants to merge 1 commit intoNVIDIA:mainfrom
ssam18:fix/hostfunc-nanobind-gil-race

Conversation

@ssam18
Copy link
Copy Markdown
Contributor

@ssam18 ssam18 commented Apr 12, 2026

The launch_hostfunc and free_hostfunc_user_data bindings had a redundant call_guard<gil_scoped_release> that opened a window between argument conversion and the function body where another thread could free the Python callable being passed in. Both functions already manage the GIL internally, so the outer call guard was doing nothing useful while creating the race. Removing it closes the window and fixes the intermittent segfault seen in disaggregated serving with Eagle3 (fixes #12694).

Summary by CodeRabbit

  • Performance Improvements
    • Optimized host function execution in the TensorRT-LLM runtime to improve concurrency and reduce synchronization overhead during CUDA stream operations.

…revent GIL race

Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 12, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1ace6f21-4609-463a-b27a-e0a5acc9860f

📥 Commits

Reviewing files that changed from the base of the PR and between 9f9feb9 and 00b96e1.

📒 Files selected for processing (1)
  • cpp/tensorrt_llm/nanobind/runtime/hostfunc.cpp

📝 Walkthrough

Walkthrough

This change removes explicit GIL (Global Interpreter Lock) handling from nanobind wrapper bindings in hostfunc.cpp. It deletes nb::gil_scoped_acquire calls from the C++ wrapper functions and removes nb::call_guard<nb::gil_scoped_release>() from binding exports, while preserving GIL management where Python code is directly invoked.

Changes

Cohort / File(s) Summary
GIL Handling Refactor
cpp/tensorrt_llm/nanobind/runtime/hostfunc.cpp
Removed nb::gil_scoped_acquire from launchHostFunc and freeHostFuncUserData functions. Removed nb::call_guard<nb::gil_scoped_release>() from m.def() bindings for both launch_hostfunc and free_hostfunc_user_data exports while retaining GIL acquisition in the cudaHostFuncTrampoline where Python code executes. Copyright year updated.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: fixing a GIL race condition in hostfunc nanobind bindings that causes intermittent segfaults.
Description check ✅ Passed The PR description explains the issue and solution clearly, but is missing sections from the template such as Test Coverage and the PR Checklist.
Linked Issues check ✅ Passed The code changes directly address issue #12694 by removing the redundant call_guard that created a GIL race window, which matches the objective to fix the nanobind-related segfault in disaggregated serving.
Out of Scope Changes check ✅ Passed All changes are in-scope: removing redundant GIL call guards from hostfunc bindings to fix the race condition, plus a copyright year update.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Apr 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: A segfault related to nanobind in disaggregated mode

2 participants