Retry on ENOLCK from NFS lockd in fcntl-based locks (#4846)#5508
Conversation
adamnovak
left a comment
There was a problem hiding this comment.
This looks pretty good, but the test code I think should be deduplicated a bit.
| ) | ||
| else: | ||
| logger.critical( | ||
| "Too many IO errors talking to lock file. If using Ceph, check for MDS deadlocks. See <https://tracker.ceph.com/issues/62123>." |
There was a problem hiding this comment.
We might wrap this message like the new one.
| if e.errno != errno.EIO: | ||
| if e.errno not in (errno.EIO, errno.ENOLCK): | ||
| raise | ||
| # Sometimes Ceph produces EIO. We don't need to retry then because |
There was a problem hiding this comment.
This comment now only mentions one of the two cases its branch needs to implement. We probably could drop that and just talk about how we don't need to retry.
| "precious" | ||
| ), f"File {filename} still exists" | ||
|
|
||
| # Tests for ENOLCK (toil#4846) |
There was a problem hiding this comment.
Do we need this? Is it better as a docstring?
| ), f"File {filename} still exists" | ||
|
|
||
| # Tests for ENOLCK (toil#4846) | ||
| def testSafeLockRetriesOnENOLCK(self) -> None: |
There was a problem hiding this comment.
These new function names ought to be snake_case.
| def testSafeLockRetriesOnENOLCK(self) -> None: | ||
| enolck = OSError(errno.ENOLCK, "No locks available") | ||
| # First call raises ENOLCK, second call succeeds | ||
| with patch("fcntl.flock", side_effect=[enolck, None]) as mock_flock: | ||
| safe_lock(0) | ||
| assert mock_flock.call_count == 2 | ||
|
|
||
| def testSafeLockFailsAfterMaxRetriesOnENOLCK(self) -> None: | ||
| enolck = OSError(errno.ENOLCK, "No locks available") | ||
| # First call raises ENOLCK, second call succeeds | ||
| with patch("fcntl.flock", side_effect=enolck): | ||
| with patch("toil.lib.threading.time.sleep"): # skip the backoff waits | ||
| try: | ||
| safe_lock(0) | ||
| assert False, "Expected OSError to be raised" | ||
| except OSError as e: | ||
| assert e.errno == errno.ENOLCK | ||
|
|
||
| def testSafeLockRetriesOnEIO(self) -> None: | ||
| eio = OSError(errno.EIO, "Input/Output Error") | ||
| # First call raises EIO, second call succeeds | ||
| with patch("fcntl.flock", side_effect=[eio, None]) as mock_flock: | ||
| safe_lock(0) | ||
| assert mock_flock.call_count == 2 | ||
|
|
||
| def testSafeLockFailsAfterMaxRetriesOnEIO(self) -> None: | ||
| eio = OSError(errno.EIO, "Input/Output Error") | ||
| # First call raises EIO, second call succeeds | ||
| with patch("fcntl.flock", side_effect=eio): | ||
| with patch("toil.lib.threading.time.sleep"): # skip the backoff waits | ||
| try: | ||
| safe_lock(0) | ||
| assert False, "Expected OSError to be raised" | ||
| except OSError as e: | ||
| assert e.errno == errno.EIO | ||
|
|
||
| def testSafeUnlockAndCloseSwallowsENOLCK(self) -> None: | ||
| enolck = OSError(errno.ENOLCK, "No locks available") | ||
| # First call raises ENOLCK, second call succeeds | ||
| with patch("fcntl.flock", side_effect=enolck): | ||
| with patch("os.close") as mock_close: | ||
| safe_unlock_and_close(0) | ||
| mock_close.assert_called_once_with(0) | ||
|
|
||
| def testSafeUnlockAndCloseSwallowsEIO(self) -> None: | ||
| # First call raises EIO, second call succeeds | ||
| eio = OSError(errno.EIO, "Input/output error") | ||
| with patch("fcntl.flock", side_effect=eio): | ||
| with patch("os.close") as mock_close: | ||
| safe_unlock_and_close(0) | ||
| mock_close.assert_called_once_with(0) |
There was a problem hiding this comment.
There are two substantially identical sets of 3 tests here, which differ just on the errno value and message used. We should consolidate them, either using inheritance (one base class with an abstract raise_error that we implement in IOError and NoLocksError subclasses), or using pytest subtests and a loop over the errno-and-message pairs (or over a constant list of pre-constructed exceptions).
|
I like the code now, but if you click on the little red X and then "Details" on the failing Gitlab job, and open up the failing I think the problem is that |
|
@annagiroti See if you can fix this up so it passes all the CI tests. You should be able to |
When Toil runs on NFS filesystems,
fcntl.flockcan raiseOSError [Errno 37] No locks available(ENOLCK) if the NFS lock daemon (lockd) is temporarily unavailable. Previously, this caused jobs to crash immediately. This PR extends the existing retry logic insafe_lock(which already handledEIOfor Ceph) to also retry onENOLCKwith exponential backoff.safe_unlock_and_closeis also updated to swallowENOLCKthe same way it doesEIO. Unit tests are added for both error cases using mockedfcntl.flock.Resolves #4846
Changelog Entry
To be copied to the draft changelog by merger:
OSError: [Errno 37] No locks availablein file locking #4846)Reviewer Checklist
issues/XXXX-fix-the-thingin the Toil repo, or from an external repo.camelCasethat want to be insnake_case.docs/running/{cliOptions,cwl,wdl}.rstMerger Checklist