fix: catch OSError in config_file_lock for NFS compatibility#11960
fix: catch OSError in config_file_lock for NFS compatibility#11960sara4dev wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
📝 WalkthroughWalkthroughExpanded exception handling in the Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/model_config.py`:
- Around line 51-54: The contextmanager config_file_lock currently contains
multiple yield points and broad try/excepts that can catch exceptions raised by
the with-block; refactor it to a single-yield pattern by performing lock
acquisition and any related filesystem error handling before yielding, then
yield exactly once to the caller, and perform lock release/cleanup in a finally
block after the yield; ensure only the lock acquisition logic (e.g.,
filelock.FileLock.acquire and fallback to tempfile logic) is wrapped in
try/except for PermissionError/OSError/filelock.Timeout, and remove any
try/except that spans the yield so exceptions from the with-block bubble up
normally.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 46c0c039-a585-4aad-aff3-13f66709b223
📒 Files selected for processing (1)
tensorrt_llm/_torch/model_config.py
| except (PermissionError, OSError, filelock.Timeout): | ||
| # Fallback to tempdir when primary lock path is unusable (e.g., | ||
| # NFS locking failures like ENOLCK/ESTALE, permission issues, | ||
| # or lock acquisition timeouts) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, examine the file to see the actual code at the flagged lines
head -n 85 tensorrt_llm/_torch/model_config.py | tail -n 45Repository: NVIDIA/TensorRT-LLM
Length of output: 1887
🏁 Script executed:
# Run the embedded verification script to check for violations
python - <<'PY'
import ast
from pathlib import Path
p = Path("tensorrt_llm/_torch/model_config.py")
tree = ast.parse(p.read_text())
fn = next((n for n in tree.body if isinstance(n, ast.FunctionDef) and n.name == "config_file_lock"), None)
if fn:
yields = [n for n in ast.walk(fn) if isinstance(n, ast.Yield)]
print("yield_count:", len(yields))
violations = []
for t in [n for n in ast.walk(fn) if isinstance(n, ast.Try)]:
has_yield = any(isinstance(x, ast.Yield) for x in ast.walk(t))
catches_oserror = any(
(h.type is not None and "OSError" in ast.unparse(h.type))
for h in t.handlers
)
if has_yield and catches_oserror:
violations.append((t.lineno, [ast.unparse(h.type) if h.type else "bare" for h in t.handlers]))
print("try_yield_catches_oserror violations:", violations)
else:
print("config_file_lock function not found")
PYRepository: NVIDIA/TensorRT-LLM
Length of output: 230
🏁 Script executed:
# Verify the exact line numbers match what was flagged
cat -n tensorrt_llm/_torch/model_config.py | sed -n '40,75p'Repository: NVIDIA/TensorRT-LLM
Length of output: 1677
🏁 Script executed:
# Double-check the contextmanager behavior with Python semantics
python3 - <<'PY'
import contextlib
# Demonstrate the problematic pattern
`@contextlib.contextmanager`
def bad_context():
print("Setup")
try:
yield
except ValueError:
print("Caught ValueError, yielding again")
yield # This is problematic!
# Test what happens
try:
with bad_context():
raise ValueError("from caller")
except Exception as e:
print(f"Final exception: {type(e).__name__}: {e}")
PYRepository: NVIDIA/TensorRT-LLM
Length of output: 169
Refactor to single-yield contextmanager; separate lock acquisition from yield scope.
This contextmanager has multiple yield statements (4 total) with try-except handlers catching OSError across yield points. When caller code inside with config_file_lock(): raises an OSError, it gets caught and re-yielded, which triggers RuntimeError: generator didn't stop after throw() and masks the original exception.
Move lock acquisition (and its error handling) before the yield. The yield body should have no try-except around it. This follows coding guidelines: "limit the except to the smallest set of errors possible" and "keep the try body small."
Refactor to single-yield contextmanager (lock acquire/release separated from yield scope)
+import errno
import contextlib
@@
`@contextlib.contextmanager`
def config_file_lock(timeout: int = 10):
@@
- # Create and acquire the lock
- lock = filelock.FileLock(str(lock_path), timeout=timeout)
-
- try:
- with lock:
- yield
- except (PermissionError, OSError, filelock.Timeout):
- # Fallback to tempdir when primary lock path is unusable (e.g.,
- # NFS locking failures like ENOLCK/ESTALE, permission issues,
- # or lock acquisition timeouts)
- tmp_dir = Path(tempfile.gettempdir())
- tmp_dir.mkdir(parents=True, exist_ok=True)
- tmp_lock_path = tmp_dir / "_remote_code.lock"
- tmp_lock = filelock.FileLock(str(tmp_lock_path), timeout=timeout)
- try:
- with tmp_lock:
- yield
- except filelock.Timeout:
- logger.warning(
- f"failed to acquire tempdir config lock within {timeout} seconds, proceeding without lock"
- )
- # proceed without lock
- yield
- except (PermissionError, OSError) as e:
- logger.warning(
- f"tempdir config lock unavailable due to OS/permission issue: {e}, proceeding without lock"
- )
- # proceed without lock
- yield
+ def _is_nfs_lock_error(exc: OSError) -> bool:
+ return exc.errno in (errno.ENOLCK, errno.ESTALE)
+
+ def _try_acquire(lock: filelock.FileLock) -> bool:
+ try:
+ lock.acquire(timeout=timeout)
+ return True
+ except (PermissionError, filelock.Timeout):
+ return False
+ except OSError as exc:
+ if _is_nfs_lock_error(exc):
+ return False
+ raise
+
+ lock = filelock.FileLock(str(lock_path), timeout=timeout)
+ active_lock: Optional[filelock.FileLock] = None
+
+ if _try_acquire(lock):
+ active_lock = lock
+ else:
+ logger.warning("Primary config lock unavailable; falling back to tempdir lock")
+ tmp_dir = Path(tempfile.gettempdir())
+ tmp_dir.mkdir(parents=True, exist_ok=True)
+ tmp_lock = filelock.FileLock(str(tmp_dir / "_remote_code.lock"), timeout=timeout)
+ if _try_acquire(tmp_lock):
+ active_lock = tmp_lock
+ else:
+ logger.warning(
+ f"Failed to acquire tempdir config lock within {timeout} seconds, proceeding without lock"
+ )
+
+ try:
+ yield
+ finally:
+ if active_lock is not None:
+ try:
+ active_lock.release()
+ except (PermissionError, filelock.Timeout, OSError) as exc:
+ logger.warning(f"Failed to release config lock cleanly: {exc}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/_torch/model_config.py` around lines 51 - 54, The contextmanager
config_file_lock currently contains multiple yield points and broad try/excepts
that can catch exceptions raised by the with-block; refactor it to a
single-yield pattern by performing lock acquisition and any related filesystem
error handling before yielding, then yield exactly once to the caller, and
perform lock release/cleanup in a finally block after the yield; ensure only the
lock acquisition logic (e.g., filelock.FileLock.acquire and fallback to tempfile
logic) is wrapped in try/except for PermissionError/OSError/filelock.Timeout,
and remove any try/except that spans the yield so exceptions from the with-block
bubble up normally.
There was a problem hiding this comment.
We might need to clean this up in the follow-up PR.
There was a problem hiding this comment.
@chienchunhung, understood — the refactor to a single-yield contextmanager (separating lock acquisition from the yield scope) can be done in a follow-up PR. The current change still improves NFS compatibility, so it's a good incremental step.
Would you like me to open a GitHub issue to track the cleanup of config_file_lock (refactor to single-yield pattern with lock acquire/release separated from the yield scope)?
🧠 Learnings used
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 11508
File: tests/unittest/_torch/sampler/test_beam_search_util.py:71-71
Timestamp: 2026-02-13T10:15:37.120Z
Learning: In TensorRT-LLM (Python requires >=3.10 and <4 as per setup.py), you can use Python 3.10+ features (e.g., PEP 585 generics like dict[str, int], list[str], etc.) throughout the codebase, and you do not need to add from __future__ import annotations. This applies to all Python files, including tests (e.g., tests/unittest/...); ensure tests and code consistently rely on Python 3.10+ features where applicable.
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 11983
File: tensorrt_llm/_torch/pyexecutor/llm_request.py:237-240
Timestamp: 2026-03-06T11:45:02.068Z
Learning: In tensorrt_llm/_torch/pyexecutor/llm_request.py, LogProbStorage.append() relies on next(iter(prob.values())).logprob to accumulate cum_log_probs when cum_log_probs is None. This path assumes prob is a non-empty dict because TorchSampler places the sampled logprob first when num_logprobs >= 0. Therefore, no guard for empty dicts is needed here. If future changes may yield empty prob, consider adding a guard or a clearer invariant.
Learnt from: yibinl-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 12009
File: tensorrt_llm/_torch/visual_gen/models/ltx2/transformer_ltx2.py:296-299
Timestamp: 2026-03-09T12:34:56.631Z
Learning: In tensorrt_llm/_torch/visual_gen/models/ltx2/transformer_ltx2.py, BasicAVTransformerBlock should not be flagged for a config mismatch when config.parallel.dit_ulysses_size > 1. The function setup_sequence_parallelism() returns use_ulysses=True for dit_ulysses_size > 1, or raises a RuntimeError/ValueError/NotImplementedError; it never returns use_ulysses=False in that case. Treat this as intentional and correct; do not flag as a mismatch between raw config checks and setup_sequence_parallelism()'s result.
|
/bot run --disable-fail-fast |
|
PR_Github #38459 [ run ] triggered by Bot. Commit: |
|
PR_Github #38459 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #38492 [ run ] triggered by Bot. Commit: |
|
PR_Github #38492 [ run ] completed with state
|
chienchunhung
left a comment
There was a problem hiding this comment.
The PR looks good to me. PS: We might need to clean up something in the follow-up PR.
|
/bot run --disable-fail-fast |
|
PR_Github #38606 [ run ] triggered by Bot. Commit: |
|
PR_Github #38606 [ run ] completed with state
|
On NFS-backed HF_MODULES_CACHE paths, filelock operations can raise OSError with errno ENOLCK (No locks available) or ESTALE (Stale file handle) instead of PermissionError. This causes config_file_lock() to crash rather than falling back to the tempdir-based lock. Add OSError to the exception handlers so the existing fallback logic handles NFS locking failures gracefully. Fixes NVIDIA#11958
72a8665 to
e548b35
Compare
Summary
config_file_lock()intensorrt_llm/_torch/model_config.pycrashes whenHF_MODULES_CACHEresides on an NFS-mounted filesystem. On NFS,filelockoperations can raiseOSErrorwith errnoENOLCK(No locks available) orESTALE(Stale file handle) instead ofPermissionError. Since the current exception handler only catchesPermissionErrorandfilelock.Timeout, these NFS-specific errors bypass the existing tempdir fallback and crash the process.This is particularly impactful in multi-node GPU clusters where a shared NFS cache is standard practice — every pod that tries to load a model config concurrently hits this crash.
Changes
Add
OSErrorto both exception handlers inconfig_file_lock():except (PermissionError, filelock.Timeout)→except (PermissionError, OSError, filelock.Timeout)— triggers the tempdir fallback for NFS errorsexcept (PermissionError)→except (PermissionError, OSError)— handles the unlikely case where tempdir also failsSince
PermissionErroris a subclass ofOSError, catchingOSErrortechnically covers both, but keepingPermissionErrorexplicit preserves the original intent and readability.Root Cause
NFSv3 uses the Network Lock Manager (NLM) protocol for file locking, which is unreliable for cross-node
flock()/fcntl()operations. Whenfilelock.FileLockattempts to acquire a lock on an NFS path:OSError: [Errno 37] No locks available(ENOLCK)OSError: [Errno 116] Stale file handle(ESTALE)The existing fallback to
/tmp(local ephemeral storage) is the correct behavior for this case — it just wasn't being triggered.Reproduction
Workaround (for users on affected versions)
Set
HF_MODULES_CACHE=/tmp/hf_modulesas an environment variable to redirect the lock file to local storage.Fixes #11958
Made with Cursor
Summary by CodeRabbit