Retry on ENOLCK from NFS lockd in fcntl-based locks (#4846) by annagiroti · Pull Request #5508 · DataBiosphere/toil

annagiroti · 2026-04-30T21:31:06Z

When Toil runs on NFS filesystems, fcntl.flock can raise OSError [Errno 37] No locks available (ENOLCK) if the NFS lock daemon (lockd) is temporarily unavailable. Previously, this caused jobs to crash immediately. This PR extends the existing retry logic in safe_lock (which already handled EIO for Ceph) to also retry on ENOLCK with exponential backoff. safe_unlock_and_close is also updated to swallow ENOLCK the same way it does EIO. Unit tests are added for both error cases using mocked fcntl.flock.

Resolves #4846

Changelog Entry

To be copied to the draft changelog by merger:

Retry on ENOLCK from NFS lockd in fcntl-based locks to handle flaky NFS mounts (Handle OSError: [Errno 37] No locks available in file locking #4846)

Reviewer Checklist

Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
- If it is coming from an external repo, make sure to pull it in for CI with:
```
contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
```
- If there is no associated issue, create one.
Read through the code changes. Make sure that it doesn't have:
- Addition of trailing whitespace.
- New variable or member names in camelCase that want to be in snake_case.
- New functions without type hints.
- New functions or classes without informative docstrings.
- Changes to semantics not reflected in the relevant docstrings.
- New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
- New features without tests.
Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
Finish the review with an overall description of your opinion.

Merger Checklist

Make sure the PR passed tests, including the Gitlab tests, for the most recent commit in its branch.
Make sure the PR has been reviewed. If not, review it. If it has been reviewed and any requested changes seem to have been addressed, proceed.
Merge with the Github "Squash and merge" feature.
- If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
Copy its recommended changelog entry to the Draft Changelog.
Append the issue number in parentheses to the changelog entry.

adamnovak

This looks pretty good, but the test code I think should be deduplicated a bit.

adamnovak · 2026-05-04T20:20:53Z

+                        )
+                    else:
+                        logger.critical(
+                            "Too many IO errors talking to lock file. If using Ceph, check for MDS deadlocks. See <https://tracker.ceph.com/issues/62123>."


We might wrap this message like the new one.

adamnovak · 2026-05-04T20:21:51Z

-        if e.errno != errno.EIO:
+        if e.errno not in (errno.EIO, errno.ENOLCK):
            raise
        # Sometimes Ceph produces EIO. We don't need to retry then because


This comment now only mentions one of the two cases its branch needs to implement. We probably could drop that and just talk about how we don't need to retry.

adamnovak · 2026-05-04T20:22:24Z

                    "precious"
                ), f"File {filename} still exists"
+
+    # Tests for ENOLCK (toil#4846)


Do we need this? Is it better as a docstring?

adamnovak · 2026-05-04T20:22:52Z

                ), f"File {filename} still exists"
+
+    # Tests for ENOLCK (toil#4846)
+    def testSafeLockRetriesOnENOLCK(self) -> None:


These new function names ought to be snake_case.

adamnovak · 2026-05-04T20:28:32Z

+    def testSafeLockRetriesOnENOLCK(self) -> None:
+        enolck = OSError(errno.ENOLCK, "No locks available")
+        # First call raises ENOLCK, second call succeeds
+        with patch("fcntl.flock", side_effect=[enolck, None]) as mock_flock:
+            safe_lock(0)
+            assert mock_flock.call_count == 2
+
+    def testSafeLockFailsAfterMaxRetriesOnENOLCK(self) -> None:
+        enolck = OSError(errno.ENOLCK, "No locks available")
+        # First call raises ENOLCK, second call succeeds
+        with patch("fcntl.flock", side_effect=enolck):
+            with patch("toil.lib.threading.time.sleep"):  # skip the backoff waits
+                try:
+                    safe_lock(0)
+                    assert False, "Expected OSError to be raised"
+                except OSError as e:
+                    assert e.errno == errno.ENOLCK
+
+    def testSafeLockRetriesOnEIO(self) -> None:
+        eio = OSError(errno.EIO, "Input/Output Error")
+        # First call raises EIO, second call succeeds
+        with patch("fcntl.flock", side_effect=[eio, None]) as mock_flock:
+            safe_lock(0)
+            assert mock_flock.call_count == 2
+
+    def testSafeLockFailsAfterMaxRetriesOnEIO(self) -> None:
+        eio = OSError(errno.EIO, "Input/Output Error")
+        # First call raises EIO, second call succeeds
+        with patch("fcntl.flock", side_effect=eio):
+            with patch("toil.lib.threading.time.sleep"):  # skip the backoff waits
+                try:
+                    safe_lock(0)
+                    assert False, "Expected OSError to be raised"
+                except OSError as e:
+                    assert e.errno == errno.EIO
+
+    def testSafeUnlockAndCloseSwallowsENOLCK(self) -> None:
+        enolck = OSError(errno.ENOLCK, "No locks available")
+        # First call raises ENOLCK, second call succeeds
+        with patch("fcntl.flock", side_effect=enolck):
+            with patch("os.close") as mock_close:
+                safe_unlock_and_close(0)
+                mock_close.assert_called_once_with(0)
+
+    def testSafeUnlockAndCloseSwallowsEIO(self) -> None:
+        # First call raises EIO, second call succeeds
+        eio = OSError(errno.EIO, "Input/output error")
+        with patch("fcntl.flock", side_effect=eio):
+            with patch("os.close") as mock_close:
+                safe_unlock_and_close(0)
+                mock_close.assert_called_once_with(0)


There are two substantially identical sets of 3 tests here, which differ just on the errno value and message used. We should consolidate them, either using inheritance (one base class with an abstract raise_error that we implement in IOError and NoLocksError subclasses), or using pytest subtests and a loop over the errno-and-message pairs (or over a constant list of pre-constructed exceptions).

…4846)

adamnovak · 2026-05-06T21:20:23Z

I like the code now, but if you click on the little red X and then "Details" on the failing Gitlab job, and open up the failing lint step and peruse the log, you can see that the type checking is failing because of this:

src/toil/test/src/threadingTest.py: note: In member "test_safe_lock_fails_after_max_retries" of class "BaseSafeLockingTest":
src/toil/test/src/threadingTest.py:102:39: error: "Exception" has no attribute
"errno"  [attr-defined]
                        assert e.errno == error.errno
                                          ^~~~~~~~~~~
Found 1 error in 1 file (checked 131 source files)

I think the problem is that get_exception() is typed to return Exception, but for us to check things based on the result's errno we need to type it as returning OSError instead.

adamnovak · 2026-05-06T21:21:19Z

@annagiroti See if you can fix this up so it passes all the CI tests. You should be able to make mypy to run the type checking locally.

Retry on ENOLCK from NFS lockd in fcntl-based locks (#4846)

dd836e8

annagiroti requested a review from adamnovak April 30, 2026 21:31

adamnovak requested changes May 4, 2026

View reviewed changes

Address review feedback: deduplicate tests, fix comments, snake_case (#…

d394a29

…4846)

adamnovak approved these changes May 6, 2026

View reviewed changes

Fix get_error() return type annotation to OSError for mypy (#4846)

ce65568

adamnovak merged commit 5904df0 into master May 7, 2026
3 checks passed

adamnovak deleted the issues/4846-retry-enolck-nfs branch May 7, 2026 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry on ENOLCK from NFS lockd in fcntl-based locks (#4846)#5508

Retry on ENOLCK from NFS lockd in fcntl-based locks (#4846)#5508
adamnovak merged 3 commits into
masterfrom
issues/4846-retry-enolck-nfs

annagiroti commented Apr 30, 2026

Uh oh!

adamnovak left a comment

Uh oh!

adamnovak May 4, 2026

Uh oh!

adamnovak May 4, 2026

Uh oh!

adamnovak May 4, 2026

Uh oh!

adamnovak May 4, 2026

Uh oh!

adamnovak May 4, 2026

Uh oh!

adamnovak commented May 6, 2026

Uh oh!

adamnovak commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

annagiroti commented Apr 30, 2026

Changelog Entry

Reviewer Checklist

Merger Checklist

Uh oh!

adamnovak left a comment

Choose a reason for hiding this comment

Uh oh!

adamnovak May 4, 2026

Choose a reason for hiding this comment

Uh oh!

adamnovak May 4, 2026

Choose a reason for hiding this comment

Uh oh!

adamnovak May 4, 2026

Choose a reason for hiding this comment

Uh oh!

adamnovak May 4, 2026

Choose a reason for hiding this comment

Uh oh!

adamnovak May 4, 2026

Choose a reason for hiding this comment

Uh oh!

adamnovak commented May 6, 2026

Uh oh!

adamnovak commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants