fix: catch OSError in config_file_lock for NFS compatibility by sara4dev · Pull Request #11960 · NVIDIA/TensorRT-LLM

sara4dev · 2026-03-05T20:18:32Z

Summary

config_file_lock() in tensorrt_llm/_torch/model_config.py crashes when HF_MODULES_CACHE resides on an NFS-mounted filesystem. On NFS, filelock operations can raise OSError with errno ENOLCK (No locks available) or ESTALE (Stale file handle) instead of PermissionError. Since the current exception handler only catches PermissionError and filelock.Timeout, these NFS-specific errors bypass the existing tempdir fallback and crash the process.

This is particularly impactful in multi-node GPU clusters where a shared NFS cache is standard practice — every pod that tries to load a model config concurrently hits this crash.

Changes

Add OSError to both exception handlers in config_file_lock():

Primary lock attempt (line 51): except (PermissionError, filelock.Timeout) → except (PermissionError, OSError, filelock.Timeout) — triggers the tempdir fallback for NFS errors
Tempdir fallback (line 66): except (PermissionError) → except (PermissionError, OSError) — handles the unlikely case where tempdir also fails

Since PermissionError is a subclass of OSError, catching OSError technically covers both, but keeping PermissionError explicit preserves the original intent and readability.

Root Cause

NFSv3 uses the Network Lock Manager (NLM) protocol for file locking, which is unreliable for cross-node flock()/fcntl() operations. When filelock.FileLock attempts to acquire a lock on an NFS path:

Lock acquisition can fail with OSError: [Errno 37] No locks available (ENOLCK)
Lock release can fail with OSError: [Errno 116] Stale file handle (ESTALE)

The existing fallback to /tmp (local ephemeral storage) is the correct behavior for this case — it just wasn't being triggered.

Reproduction

# On any NFS-mounted path, run from multiple nodes concurrently:
import filelock
lock = filelock.FileLock("/nfs-mount/test.lock")
with lock:  # Raises OSError: [Errno 37] No locks available
    pass

Workaround (for users on affected versions)

Set HF_MODULES_CACHE=/tmp/hf_modules as an environment variable to redirect the lock file to local storage.

Fixes #11958

Made with Cursor

Summary by CodeRabbit

Bug Fixes
- Enhanced file locking behavior to catch additional OS and permission-related errors. Added temporary directory-based fallback locking with timeout handling.

coderabbitai · 2026-03-05T20:22:53Z

📝 Walkthrough

Walkthrough

Expanded exception handling in the config_file_lock() function to catch OSError alongside existing PermissionError and filelock.Timeout exceptions. Added a fallback lock mechanism using a temporary directory-based lock file with timeout handling and logging.

Changes

Cohort / File(s)	Summary
Lock Error Handling Improvements `tensorrt_llm/_torch/model_config.py`	Extended exception handling to catch `OSError` (NFS lock failures). Added fallback mechanism: on primary lock failure, create and use a temporary `_remote_code.lock` file via `FileLock`. Implemented timeout handling with warning log if tempdir lock cannot be acquired. Updated comments to document OS/permission issues and NFS locking failures as fallback triggers.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: adding OSError handling to config_file_lock for NFS compatibility. It matches the core objective of the PR.
Description check	✅ Passed	The description follows the template structure with Summary, Changes, Root Cause, Reproduction, and Workaround sections. It provides sufficient context about the issue, the NFS-specific problem, and the solution.
Linked Issues check	✅ Passed	The PR modifications directly address issue `#11958` by catching OSError in both the primary lock attempt and tempdir fallback path, enabling the existing tempdir fallback for NFS-specific errors as required.
Out of Scope Changes check	✅ Passed	All changes in tensorrt_llm/_torch/model_config.py are directly related to fixing the NFS compatibility issue in config_file_lock. No out-of-scope modifications detected.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/model_config.py`:
- Around line 51-54: The contextmanager config_file_lock currently contains
multiple yield points and broad try/excepts that can catch exceptions raised by
the with-block; refactor it to a single-yield pattern by performing lock
acquisition and any related filesystem error handling before yielding, then
yield exactly once to the caller, and perform lock release/cleanup in a finally
block after the yield; ensure only the lock acquisition logic (e.g.,
filelock.FileLock.acquire and fallback to tempfile logic) is wrapped in
try/except for PermissionError/OSError/filelock.Timeout, and remove any
try/except that spans the yield so exceptions from the with-block bubble up
normally.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 46c0c039-a585-4aad-aff3-13f66709b223

📥 Commits

Reviewing files that changed from the base of the PR and between 4786834 and 72a8665.

📒 Files selected for processing (1)

tensorrt_llm/_torch/model_config.py

coderabbitai · 2026-03-05T20:22:56Z

+    except (PermissionError, OSError, filelock.Timeout):
+        # Fallback to tempdir when primary lock path is unusable (e.g.,
+        # NFS locking failures like ENOLCK/ESTALE, permission issues,
+        # or lock acquisition timeouts)


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, examine the file to see the actual code at the flagged lines head -n 85 tensorrt_llm/_torch/model_config.py | tail -n 45

Repository: NVIDIA/TensorRT-LLM

Length of output: 1887

🏁 Script executed:

# Run the embedded verification script to check for violations python - <<'PY' import ast from pathlib import Path p = Path("tensorrt_llm/_torch/model_config.py") tree = ast.parse(p.read_text()) fn = next((n for n in tree.body if isinstance(n, ast.FunctionDef) and n.name == "config_file_lock"), None) if fn: yields = [n for n in ast.walk(fn) if isinstance(n, ast.Yield)] print("yield_count:", len(yields)) violations = [] for t in [n for n in ast.walk(fn) if isinstance(n, ast.Try)]: has_yield = any(isinstance(x, ast.Yield) for x in ast.walk(t)) catches_oserror = any( (h.type is not None and "OSError" in ast.unparse(h.type)) for h in t.handlers ) if has_yield and catches_oserror: violations.append((t.lineno, [ast.unparse(h.type) if h.type else "bare" for h in t.handlers])) print("try_yield_catches_oserror violations:", violations) else: print("config_file_lock function not found") PY

Repository: NVIDIA/TensorRT-LLM

Length of output: 230

🏁 Script executed:

# Verify the exact line numbers match what was flagged cat -n tensorrt_llm/_torch/model_config.py | sed -n '40,75p'

Repository: NVIDIA/TensorRT-LLM

Length of output: 1677

🏁 Script executed:

# Double-check the contextmanager behavior with Python semantics python3 - <<'PY' import contextlib # Demonstrate the problematic pattern `@contextlib.contextmanager` def bad_context(): print("Setup") try: yield except ValueError: print("Caught ValueError, yielding again") yield # This is problematic! # Test what happens try: with bad_context(): raise ValueError("from caller") except Exception as e: print(f"Final exception: {type(e).__name__}: {e}") PY

Repository: NVIDIA/TensorRT-LLM

Length of output: 169

Refactor to single-yield contextmanager; separate lock acquisition from yield scope.

This contextmanager has multiple yield statements (4 total) with try-except handlers catching OSError across yield points. When caller code inside with config_file_lock(): raises an OSError, it gets caught and re-yielded, which triggers RuntimeError: generator didn't stop after throw() and masks the original exception.

Move lock acquisition (and its error handling) before the yield. The yield body should have no try-except around it. This follows coding guidelines: "limit the except to the smallest set of errors possible" and "keep the try body small."

Refactor to single-yield contextmanager (lock acquire/release separated from yield scope)

+import errno import contextlib @@ `@contextlib.contextmanager` def config_file_lock(timeout: int = 10): @@ - # Create and acquire the lock - lock = filelock.FileLock(str(lock_path), timeout=timeout) - - try: - with lock: - yield - except (PermissionError, OSError, filelock.Timeout): - # Fallback to tempdir when primary lock path is unusable (e.g., - # NFS locking failures like ENOLCK/ESTALE, permission issues, - # or lock acquisition timeouts) - tmp_dir = Path(tempfile.gettempdir()) - tmp_dir.mkdir(parents=True, exist_ok=True) - tmp_lock_path = tmp_dir / "_remote_code.lock" - tmp_lock = filelock.FileLock(str(tmp_lock_path), timeout=timeout) - try: - with tmp_lock: - yield - except filelock.Timeout: - logger.warning( - f"failed to acquire tempdir config lock within {timeout} seconds, proceeding without lock" - ) - # proceed without lock - yield - except (PermissionError, OSError) as e: - logger.warning( - f"tempdir config lock unavailable due to OS/permission issue: {e}, proceeding without lock" - ) - # proceed without lock - yield + def _is_nfs_lock_error(exc: OSError) -> bool: + return exc.errno in (errno.ENOLCK, errno.ESTALE) + + def _try_acquire(lock: filelock.FileLock) -> bool: + try: + lock.acquire(timeout=timeout) + return True + except (PermissionError, filelock.Timeout): + return False + except OSError as exc: + if _is_nfs_lock_error(exc): + return False + raise + + lock = filelock.FileLock(str(lock_path), timeout=timeout) + active_lock: Optional[filelock.FileLock] = None + + if _try_acquire(lock): + active_lock = lock + else: + logger.warning("Primary config lock unavailable; falling back to tempdir lock") + tmp_dir = Path(tempfile.gettempdir()) + tmp_dir.mkdir(parents=True, exist_ok=True) + tmp_lock = filelock.FileLock(str(tmp_dir / "_remote_code.lock"), timeout=timeout) + if _try_acquire(tmp_lock): + active_lock = tmp_lock + else: + logger.warning( + f"Failed to acquire tempdir config lock within {timeout} seconds, proceeding without lock" + ) + + try: + yield + finally: + if active_lock is not None: + try: + active_lock.release() + except (PermissionError, filelock.Timeout, OSError) as exc: + logger.warning(f"Failed to release config lock cleanly: {exc}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/model_config.py` around lines 51 - 54, The contextmanager config_file_lock currently contains multiple yield points and broad try/excepts that can catch exceptions raised by the with-block; refactor it to a single-yield pattern by performing lock acquisition and any related filesystem error handling before yielding, then yield exactly once to the caller, and perform lock release/cleanup in a finally block after the yield; ensure only the lock acquisition logic (e.g., filelock.FileLock.acquire and fallback to tempfile logic) is wrapped in try/except for PermissionError/OSError/filelock.Timeout, and remove any try/except that spans the yield so exceptions from the with-block bubble up normally.

We might need to clean this up in the follow-up PR.

@chienchunhung, understood — the refactor to a single-yield contextmanager (separating lock acquisition from the yield scope) can be done in a follow-up PR. The current change still improves NFS compatibility, so it's a good incremental step.

Would you like me to open a GitHub issue to track the cleanup of config_file_lock (refactor to single-yield pattern with lock acquire/release separated from the yield scope)?

🧠 Learnings used

Learnt from: ixlmar Repo: NVIDIA/TensorRT-LLM PR: 11508 File: tests/unittest/_torch/sampler/test_beam_search_util.py:71-71 Timestamp: 2026-02-13T10:15:37.120Z Learning: In TensorRT-LLM (Python requires >=3.10 and <4 as per setup.py), you can use Python 3.10+ features (e.g., PEP 585 generics like dict[str, int], list[str], etc.) throughout the codebase, and you do not need to add from __future__ import annotations. This applies to all Python files, including tests (e.g., tests/unittest/...); ensure tests and code consistently rely on Python 3.10+ features where applicable.

Learnt from: ixlmar Repo: NVIDIA/TensorRT-LLM PR: 11983 File: tensorrt_llm/_torch/pyexecutor/llm_request.py:237-240 Timestamp: 2026-03-06T11:45:02.068Z Learning: In tensorrt_llm/_torch/pyexecutor/llm_request.py, LogProbStorage.append() relies on next(iter(prob.values())).logprob to accumulate cum_log_probs when cum_log_probs is None. This path assumes prob is a non-empty dict because TorchSampler places the sampled logprob first when num_logprobs >= 0. Therefore, no guard for empty dicts is needed here. If future changes may yield empty prob, consider adding a guard or a clearer invariant.

Learnt from: yibinl-nvidia Repo: NVIDIA/TensorRT-LLM PR: 12009 File: tensorrt_llm/_torch/visual_gen/models/ltx2/transformer_ltx2.py:296-299 Timestamp: 2026-03-09T12:34:56.631Z Learning: In tensorrt_llm/_torch/visual_gen/models/ltx2/transformer_ltx2.py, BasicAVTransformerBlock should not be flagged for a config mismatch when config.parallel.dit_ulysses_size > 1. The function setup_sequence_parallelism() returns use_ulysses=True for dit_ulysses_size > 1, or raises a RuntimeError/ValueError/NotImplementedError; it never returns use_ulysses=False in that case. Treat this as intentional and correct; do not flag as a mismatch between raw config checks and setup_sequence_parallelism()'s result.

pengbowang-nv · 2026-03-06T07:26:32Z

Hi @chang-l could you also take a look at this one? Thanks!

Also to @sara4dev maybe we should check for errno as OSError is a broad one. In addition, you may need to finish DCO before merge (see github failed check for details).

chienchunhung · 2026-03-10T16:14:41Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-10T16:20:59Z

PR_Github #38459 [ run ] triggered by Bot. Commit: 72a8665 Link to invocation

tensorrt-cicd · 2026-03-10T20:12:23Z

PR_Github #38459 [ run ] completed with state ABORTED. Commit: 72a8665
LLM/main/L0_MergeRequest_PR #29815 (Blue Ocean) completed with status: ABORTED

Link to invocation

chienchunhung · 2026-03-10T20:49:41Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-10T20:56:59Z

PR_Github #38492 [ run ] triggered by Bot. Commit: 72a8665 Link to invocation

tensorrt-cicd · 2026-03-11T00:58:39Z

PR_Github #38492 [ run ] completed with state SUCCESS. Commit: 72a8665
/LLM/main/L0_MergeRequest_PR pipeline #29842 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

chienchunhung

The PR looks good to me. PS: We might need to clean up something in the follow-up PR.

chienchunhung · 2026-03-11T15:56:34Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-11T16:03:17Z

PR_Github #38606 [ run ] triggered by Bot. Commit: 72a8665 Link to invocation

tensorrt-cicd · 2026-03-11T17:34:45Z

PR_Github #38606 [ run ] completed with state SUCCESS. Commit: 72a8665
/LLM/main/L0_MergeRequest_PR pipeline #29942 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

On NFS-backed HF_MODULES_CACHE paths, filelock operations can raise OSError with errno ENOLCK (No locks available) or ESTALE (Stale file handle) instead of PermissionError. This causes config_file_lock() to crash rather than falling back to the tempdir-based lock. Add OSError to the exception handlers so the existing fallback logic handles NFS locking failures gracefully. Fixes NVIDIA#11958

sara4dev requested a review from a team as a code owner March 5, 2026 20:18

sara4dev requested review from Naveassaf and leslie-fang25 March 5, 2026 20:18

coderabbitai Bot reviewed Mar 5, 2026

View reviewed changes

svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Mar 5, 2026

pengbowang-nv requested a review from chang-l March 9, 2026 05:40

pengbowang-nv assigned chang-l Mar 9, 2026

schetlur-nv requested a review from chienchunhung March 9, 2026 22:02

chang-l approved these changes Mar 11, 2026

View reviewed changes

chienchunhung approved these changes Mar 11, 2026

View reviewed changes

pengbowang-nv force-pushed the fix/config-file-lock-nfs-oserror branch from 72a8665 to e548b35 Compare March 23, 2026 06:54

Conversation

sara4dev commented Mar 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Root Cause

Reproduction

Workaround (for users on affected versions)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 5, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chienchunhung Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

pengbowang-nv commented Mar 6, 2026

Uh oh!

chienchunhung commented Mar 10, 2026

Uh oh!

tensorrt-cicd commented Mar 10, 2026

Uh oh!

tensorrt-cicd commented Mar 10, 2026

Uh oh!

chienchunhung commented Mar 10, 2026

Uh oh!

tensorrt-cicd commented Mar 10, 2026

Uh oh!

tensorrt-cicd commented Mar 11, 2026

Uh oh!

chienchunhung left a comment

Choose a reason for hiding this comment

Uh oh!

chienchunhung commented Mar 11, 2026

Uh oh!

tensorrt-cicd commented Mar 11, 2026

Uh oh!

tensorrt-cicd commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sara4dev commented Mar 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot Mar 5, 2026 •

edited

Loading