Skip to content

fix: catch OSError in config_file_lock for NFS compatibility#11960

Open
sara4dev wants to merge 1 commit intoNVIDIA:mainfrom
sara4dev:fix/config-file-lock-nfs-oserror
Open

fix: catch OSError in config_file_lock for NFS compatibility#11960
sara4dev wants to merge 1 commit intoNVIDIA:mainfrom
sara4dev:fix/config-file-lock-nfs-oserror

Conversation

@sara4dev
Copy link
Copy Markdown

@sara4dev sara4dev commented Mar 5, 2026

Summary

config_file_lock() in tensorrt_llm/_torch/model_config.py crashes when HF_MODULES_CACHE resides on an NFS-mounted filesystem. On NFS, filelock operations can raise OSError with errno ENOLCK (No locks available) or ESTALE (Stale file handle) instead of PermissionError. Since the current exception handler only catches PermissionError and filelock.Timeout, these NFS-specific errors bypass the existing tempdir fallback and crash the process.

This is particularly impactful in multi-node GPU clusters where a shared NFS cache is standard practice — every pod that tries to load a model config concurrently hits this crash.

Changes

Add OSError to both exception handlers in config_file_lock():

  1. Primary lock attempt (line 51): except (PermissionError, filelock.Timeout)except (PermissionError, OSError, filelock.Timeout) — triggers the tempdir fallback for NFS errors
  2. Tempdir fallback (line 66): except (PermissionError)except (PermissionError, OSError) — handles the unlikely case where tempdir also fails

Since PermissionError is a subclass of OSError, catching OSError technically covers both, but keeping PermissionError explicit preserves the original intent and readability.

Root Cause

NFSv3 uses the Network Lock Manager (NLM) protocol for file locking, which is unreliable for cross-node flock()/fcntl() operations. When filelock.FileLock attempts to acquire a lock on an NFS path:

  • Lock acquisition can fail with OSError: [Errno 37] No locks available (ENOLCK)
  • Lock release can fail with OSError: [Errno 116] Stale file handle (ESTALE)

The existing fallback to /tmp (local ephemeral storage) is the correct behavior for this case — it just wasn't being triggered.

Reproduction

# On any NFS-mounted path, run from multiple nodes concurrently:
import filelock
lock = filelock.FileLock("/nfs-mount/test.lock")
with lock:  # Raises OSError: [Errno 37] No locks available
    pass

Workaround (for users on affected versions)

Set HF_MODULES_CACHE=/tmp/hf_modules as an environment variable to redirect the lock file to local storage.

Fixes #11958

Made with Cursor

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced file locking behavior to catch additional OS and permission-related errors. Added temporary directory-based fallback locking with timeout handling.

@sara4dev sara4dev requested a review from a team as a code owner March 5, 2026 20:18
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 5, 2026

📝 Walkthrough

Walkthrough

Expanded exception handling in the config_file_lock() function to catch OSError alongside existing PermissionError and filelock.Timeout exceptions. Added a fallback lock mechanism using a temporary directory-based lock file with timeout handling and logging.

Changes

Cohort / File(s) Summary
Lock Error Handling Improvements
tensorrt_llm/_torch/model_config.py
Extended exception handling to catch OSError (NFS lock failures). Added fallback mechanism: on primary lock failure, create and use a temporary _remote_code.lock file via FileLock. Implemented timeout handling with warning log if tempdir lock cannot be acquired. Updated comments to document OS/permission issues and NFS locking failures as fallback triggers.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding OSError handling to config_file_lock for NFS compatibility. It matches the core objective of the PR.
Description check ✅ Passed The description follows the template structure with Summary, Changes, Root Cause, Reproduction, and Workaround sections. It provides sufficient context about the issue, the NFS-specific problem, and the solution.
Linked Issues check ✅ Passed The PR modifications directly address issue #11958 by catching OSError in both the primary lock attempt and tempdir fallback path, enabling the existing tempdir fallback for NFS-specific errors as required.
Out of Scope Changes check ✅ Passed All changes in tensorrt_llm/_torch/model_config.py are directly related to fixing the NFS compatibility issue in config_file_lock. No out-of-scope modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/model_config.py`:
- Around line 51-54: The contextmanager config_file_lock currently contains
multiple yield points and broad try/excepts that can catch exceptions raised by
the with-block; refactor it to a single-yield pattern by performing lock
acquisition and any related filesystem error handling before yielding, then
yield exactly once to the caller, and perform lock release/cleanup in a finally
block after the yield; ensure only the lock acquisition logic (e.g.,
filelock.FileLock.acquire and fallback to tempfile logic) is wrapped in
try/except for PermissionError/OSError/filelock.Timeout, and remove any
try/except that spans the yield so exceptions from the with-block bubble up
normally.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 46c0c039-a585-4aad-aff3-13f66709b223

📥 Commits

Reviewing files that changed from the base of the PR and between 4786834 and 72a8665.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/model_config.py

Comment on lines +51 to +54
except (PermissionError, OSError, filelock.Timeout):
# Fallback to tempdir when primary lock path is unusable (e.g.,
# NFS locking failures like ENOLCK/ESTALE, permission issues,
# or lock acquisition timeouts)
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, examine the file to see the actual code at the flagged lines
head -n 85 tensorrt_llm/_torch/model_config.py | tail -n 45

Repository: NVIDIA/TensorRT-LLM

Length of output: 1887


🏁 Script executed:

# Run the embedded verification script to check for violations
python - <<'PY'
import ast
from pathlib import Path

p = Path("tensorrt_llm/_torch/model_config.py")
tree = ast.parse(p.read_text())

fn = next((n for n in tree.body if isinstance(n, ast.FunctionDef) and n.name == "config_file_lock"), None)
if fn:
    yields = [n for n in ast.walk(fn) if isinstance(n, ast.Yield)]
    print("yield_count:", len(yields))

    violations = []
    for t in [n for n in ast.walk(fn) if isinstance(n, ast.Try)]:
        has_yield = any(isinstance(x, ast.Yield) for x in ast.walk(t))
        catches_oserror = any(
            (h.type is not None and "OSError" in ast.unparse(h.type))
            for h in t.handlers
        )
        if has_yield and catches_oserror:
            violations.append((t.lineno, [ast.unparse(h.type) if h.type else "bare" for h in t.handlers]))

    print("try_yield_catches_oserror violations:", violations)
else:
    print("config_file_lock function not found")
PY

Repository: NVIDIA/TensorRT-LLM

Length of output: 230


🏁 Script executed:

# Verify the exact line numbers match what was flagged
cat -n tensorrt_llm/_torch/model_config.py | sed -n '40,75p'

Repository: NVIDIA/TensorRT-LLM

Length of output: 1677


🏁 Script executed:

# Double-check the contextmanager behavior with Python semantics
python3 - <<'PY'
import contextlib

# Demonstrate the problematic pattern
`@contextlib.contextmanager`
def bad_context():
    print("Setup")
    try:
        yield
    except ValueError:
        print("Caught ValueError, yielding again")
        yield  # This is problematic!

# Test what happens
try:
    with bad_context():
        raise ValueError("from caller")
except Exception as e:
    print(f"Final exception: {type(e).__name__}: {e}")
PY

Repository: NVIDIA/TensorRT-LLM

Length of output: 169


Refactor to single-yield contextmanager; separate lock acquisition from yield scope.

This contextmanager has multiple yield statements (4 total) with try-except handlers catching OSError across yield points. When caller code inside with config_file_lock(): raises an OSError, it gets caught and re-yielded, which triggers RuntimeError: generator didn't stop after throw() and masks the original exception.

Move lock acquisition (and its error handling) before the yield. The yield body should have no try-except around it. This follows coding guidelines: "limit the except to the smallest set of errors possible" and "keep the try body small."

Refactor to single-yield contextmanager (lock acquire/release separated from yield scope)
+import errno
 import contextlib
@@
 `@contextlib.contextmanager`
 def config_file_lock(timeout: int = 10):
@@
-    # Create and acquire the lock
-    lock = filelock.FileLock(str(lock_path), timeout=timeout)
-
-    try:
-        with lock:
-            yield
-    except (PermissionError, OSError, filelock.Timeout):
-        # Fallback to tempdir when primary lock path is unusable (e.g.,
-        # NFS locking failures like ENOLCK/ESTALE, permission issues,
-        # or lock acquisition timeouts)
-        tmp_dir = Path(tempfile.gettempdir())
-        tmp_dir.mkdir(parents=True, exist_ok=True)
-        tmp_lock_path = tmp_dir / "_remote_code.lock"
-        tmp_lock = filelock.FileLock(str(tmp_lock_path), timeout=timeout)
-        try:
-            with tmp_lock:
-                yield
-        except filelock.Timeout:
-            logger.warning(
-                f"failed to acquire tempdir config lock within {timeout} seconds, proceeding without lock"
-            )
-            # proceed without lock
-            yield
-        except (PermissionError, OSError) as e:
-            logger.warning(
-                f"tempdir config lock unavailable due to OS/permission issue: {e}, proceeding without lock"
-            )
-            # proceed without lock
-            yield
+    def _is_nfs_lock_error(exc: OSError) -> bool:
+        return exc.errno in (errno.ENOLCK, errno.ESTALE)
+
+    def _try_acquire(lock: filelock.FileLock) -> bool:
+        try:
+            lock.acquire(timeout=timeout)
+            return True
+        except (PermissionError, filelock.Timeout):
+            return False
+        except OSError as exc:
+            if _is_nfs_lock_error(exc):
+                return False
+            raise
+
+    lock = filelock.FileLock(str(lock_path), timeout=timeout)
+    active_lock: Optional[filelock.FileLock] = None
+
+    if _try_acquire(lock):
+        active_lock = lock
+    else:
+        logger.warning("Primary config lock unavailable; falling back to tempdir lock")
+        tmp_dir = Path(tempfile.gettempdir())
+        tmp_dir.mkdir(parents=True, exist_ok=True)
+        tmp_lock = filelock.FileLock(str(tmp_dir / "_remote_code.lock"), timeout=timeout)
+        if _try_acquire(tmp_lock):
+            active_lock = tmp_lock
+        else:
+            logger.warning(
+                f"Failed to acquire tempdir config lock within {timeout} seconds, proceeding without lock"
+            )
+
+    try:
+        yield
+    finally:
+        if active_lock is not None:
+            try:
+                active_lock.release()
+            except (PermissionError, filelock.Timeout, OSError) as exc:
+                logger.warning(f"Failed to release config lock cleanly: {exc}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/model_config.py` around lines 51 - 54, The contextmanager
config_file_lock currently contains multiple yield points and broad try/excepts
that can catch exceptions raised by the with-block; refactor it to a
single-yield pattern by performing lock acquisition and any related filesystem
error handling before yielding, then yield exactly once to the caller, and
perform lock release/cleanup in a finally block after the yield; ensure only the
lock acquisition logic (e.g., filelock.FileLock.acquire and fallback to tempfile
logic) is wrapped in try/except for PermissionError/OSError/filelock.Timeout,
and remove any try/except that spans the yield so exceptions from the with-block
bubble up normally.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to clean this up in the follow-up PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chienchunhung, understood — the refactor to a single-yield contextmanager (separating lock acquisition from the yield scope) can be done in a follow-up PR. The current change still improves NFS compatibility, so it's a good incremental step.

Would you like me to open a GitHub issue to track the cleanup of config_file_lock (refactor to single-yield pattern with lock acquire/release separated from the yield scope)?


🧠 Learnings used
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 11508
File: tests/unittest/_torch/sampler/test_beam_search_util.py:71-71
Timestamp: 2026-02-13T10:15:37.120Z
Learning: In TensorRT-LLM (Python requires >=3.10 and <4 as per setup.py), you can use Python 3.10+ features (e.g., PEP 585 generics like dict[str, int], list[str], etc.) throughout the codebase, and you do not need to add from __future__ import annotations. This applies to all Python files, including tests (e.g., tests/unittest/...); ensure tests and code consistently rely on Python 3.10+ features where applicable.

Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 11983
File: tensorrt_llm/_torch/pyexecutor/llm_request.py:237-240
Timestamp: 2026-03-06T11:45:02.068Z
Learning: In tensorrt_llm/_torch/pyexecutor/llm_request.py, LogProbStorage.append() relies on next(iter(prob.values())).logprob to accumulate cum_log_probs when cum_log_probs is None. This path assumes prob is a non-empty dict because TorchSampler places the sampled logprob first when num_logprobs >= 0. Therefore, no guard for empty dicts is needed here. If future changes may yield empty prob, consider adding a guard or a clearer invariant.

Learnt from: yibinl-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 12009
File: tensorrt_llm/_torch/visual_gen/models/ltx2/transformer_ltx2.py:296-299
Timestamp: 2026-03-09T12:34:56.631Z
Learning: In tensorrt_llm/_torch/visual_gen/models/ltx2/transformer_ltx2.py, BasicAVTransformerBlock should not be flagged for a config mismatch when config.parallel.dit_ulysses_size > 1. The function setup_sequence_parallelism() returns use_ulysses=True for dit_ulysses_size > 1, or raises a RuntimeError/ValueError/NotImplementedError; it never returns use_ulysses=False in that case. Treat this as intentional and correct; do not flag as a mismatch between raw config checks and setup_sequence_parallelism()'s result.

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Mar 5, 2026
@pengbowang-nv
Copy link
Copy Markdown
Collaborator

Hi @chang-l could you also take a look at this one? Thanks!

Also to @sara4dev maybe we should check for errno as OSError is a broad one. In addition, you may need to finish DCO before merge (see github failed check for details).

@chienchunhung
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38459 [ run ] triggered by Bot. Commit: 72a8665 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38459 [ run ] completed with state ABORTED. Commit: 72a8665
LLM/main/L0_MergeRequest_PR #29815 (Blue Ocean) completed with status: ABORTED

Link to invocation

@chienchunhung
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38492 [ run ] triggered by Bot. Commit: 72a8665 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38492 [ run ] completed with state SUCCESS. Commit: 72a8665
/LLM/main/L0_MergeRequest_PR pipeline #29842 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Copy link
Copy Markdown
Collaborator

@chienchunhung chienchunhung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good to me. PS: We might need to clean up something in the follow-up PR.

@chienchunhung
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38606 [ run ] triggered by Bot. Commit: 72a8665 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38606 [ run ] completed with state SUCCESS. Commit: 72a8665
/LLM/main/L0_MergeRequest_PR pipeline #29942 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

On NFS-backed HF_MODULES_CACHE paths, filelock operations can raise
OSError with errno ENOLCK (No locks available) or ESTALE (Stale file
handle) instead of PermissionError. This causes config_file_lock() to
crash rather than falling back to the tempdir-based lock.

Add OSError to the exception handlers so the existing fallback logic
handles NFS locking failures gracefully.

Fixes NVIDIA#11958
@pengbowang-nv pengbowang-nv force-pushed the fix/config-file-lock-nfs-oserror branch from 72a8665 to e548b35 Compare March 23, 2026 06:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

config_file_lock() fails with OSError: [Errno 37] No locks available on NFS-backed HF cache

6 participants