Fix stale locks (#56) and add __repr__/__str__ to public classes (#21)#63
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdded stale-lock detection and recovery during multi-file retrieval; changed upload-failure message composition; introduced repr/str for FileCache, FileCacheSource, and FCPath; added POSIX-only stale-lock and repr tests; removed an extra blank line in Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #63 +/- ##
==========================================
+ Coverage 89.55% 89.73% +0.17%
==========================================
Files 5 5
Lines 2413 2465 +52
Branches 507 518 +11
==========================================
+ Hits 2161 2212 +51
- Misses 149 150 +1
Partials 103 103 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@filecache/file_cache_source.py`:
- Around line 1296-1301: The blanket except around blob.upload_from_filename
hides a FileNotFoundError and incorrectly converts it to UploadFailed; update
the handler in the upload call(s) (the blob.upload_from_filename call sites) so
that if the caught exception is a FileNotFoundError it is re-raised unchanged,
otherwise wrap/raise an UploadFailed with the existing message that includes
self._src_prefix_ and sub_path; apply the same change to the other identical
upload block at the later location mentioned (around the 1627-1636 block).
In `@filecache/file_cache.py`:
- Around line 1973-1981: The docstring for the public FileCache.upload() is out
of sync: the implementation can raise UploadFailed in addition to
FileNotFoundError; update the FileCache.upload() docstring to document that
FileNotFoundError is raised when local files are missing and UploadFailed is
raised when remote upload(s) fail (include that both can be raised together in
the error message logic that combines files_not_exist and files_failed), so
callers know to catch UploadFailed as well as FileNotFoundError.
In `@tests/test_file_cache.py`:
- Around line 1155-1159: Replace the brittle time.sleep(0.3) with a state-based
barrier: instead of sleeping, wait for an observable condition that proves
retrieve() has reached the stale-lock wait path (e.g. a threading.Event or flag
set when the code enters wait_to_appear, or by polling until retrieve_thread is
alive AND the other files have finished downloading), then call
flock_release.set(); specifically modify the test around
retrieve_thread/retrieve() and flock_release so the test waits on that
event/condition (or polls for “all other files finished while retrieve_thread is
still alive”) before releasing the flock to reliably exercise the stale-lock
branch.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 19a6fbc5-2a9f-495a-9208-fc19978392c4
📒 Files selected for processing (5)
filecache/__init__.pyfilecache/exceptions.pyfilecache/file_cache.pyfilecache/file_cache_source.pytests/test_file_cache.py
Reverts the #12 upload-exception changes (removes UploadFailed / FileCacheError, restores native exceptions). Implements #21: adds __repr__ and __str__ to FileCache, enhances FCPath.__repr__ to include non-default options (filecache, anonymous, lock_timeout, nthreads), and adds __repr__/__str__ to FileCacheSource and all subclasses via the base class. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (1)
tests/test_file_cache.py (1)
1155-1159:⚠️ Potential issue | 🟠 MajorReplace the fixed sleep with a state-based barrier.
time.sleep(0.3)still doesn't prove the retrieve thread has entered the stale-lock wait path, so these tests can miss the regression or flake on slower/faster CI. Wait for an observable state before callingflock_release.set(), such as “all other files finished downloading whileretrieve_threadis still alive”.Suggested direction
- # Give the retrieve thread time to enter the wait_to_appear loop. - time.sleep(0.3) + deadline = time.monotonic() + 5 + while time.monotonic() < deadline: + if (fc.download_counter == len(EXPECTED_FILENAMES) - 1 and + retrieve_thread.is_alive()): + break + time.sleep(0.01) + else: + pytest.fail('retrieve never reached the blocked/stale-lock state') # Signal the holder to "crash" -- releases flock, keeps the lock file. flock_release.set()Also applies to: 1216-1218
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_file_cache.py` around lines 1155 - 1159, Replace the brittle time.sleep(0.3) with a state-based barrier: wait (with timeout) until the retrieve_thread is confirmed to be in the stale-lock wait path and the other downloads have completed before calling flock_release.set(). For example, poll or wait on a test Event/Condition (set by the code under test or by observing a shared counter) that indicates "retrieve_thread is waiting" and that "other files finished downloading" (use retrieve_thread.is_alive() plus a downloaded-files counter or a retrieve_waiting_event) and only then call flock_release.set(); keep a short timeout to fail fast if the state never appears.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@filecache/file_cache_path.py`:
- Around line 541-551: FCPath.__repr__ currently omits the translator hooks, so
objects with different url_to_url or url_to_path print identically; update
FCPath.__repr__ to include these attributes when present by appending entries
like f'url_to_url={self._url_to_url!r}' and f'url_to_path={self._url_to_path!r}'
to the parts list (guarded by checks for None just like the other fields) so the
returned string reflects the translator hooks' presence and values.
In `@filecache/file_cache_source.py`:
- Around line 76-78: The __repr__ implementation in FileCacheSource uses
self._src_prefix but the class (and its subclasses) are constructed with
(scheme, remote, *, anonymous=...), so update FileCacheSource.__repr__ to emit a
constructor-like representation using self._scheme and self._remote (and keep
the anonymous flag from self._anonymous) instead of self._src_prefix; ensure the
string format matches other constructors by including scheme and remote as the
first two quoted args and anonymous=<bool> as the kwarg.
In `@filecache/file_cache.py`:
- Around line 421-425: The __repr__ currently prints self._cache_dir.name (which
includes the internal "_filecache_" prefix) causing misleading reconstructions;
update FileCache.__repr__ to display the user-provided cache name attribute
(e.g., self._cache_name) instead of self._cache_dir.name and explicitly show
None when no name was provided so repr round-trips correctly, or if
self._cache_name doesn't exist use the original user-facing name field rather
than the generated directory name.
- Around line 1981-1989: The code currently raises FileNotFoundError for both
missing local files and provider upload failures, losing distinction; update the
branch in file_cache.py so that missing local files still raise
FileNotFoundError (when files_not_exist and not files_failed) but when
files_failed is non-empty raise a distinct exception (e.g., UploadFailedError or
ProviderUploadError) with the exc_str and any underlying provider exception
info; if such a custom exception class doesn’t exist, add one (e.g., class
UploadFailedError(Exception)) and use it in the block that builds exc_str (and
include files_not_exist text only as part of the UploadFailedError message when
both exist).
---
Duplicate comments:
In `@tests/test_file_cache.py`:
- Around line 1155-1159: Replace the brittle time.sleep(0.3) with a state-based
barrier: wait (with timeout) until the retrieve_thread is confirmed to be in the
stale-lock wait path and the other downloads have completed before calling
flock_release.set(). For example, poll or wait on a test Event/Condition (set by
the code under test or by observing a shared counter) that indicates
"retrieve_thread is waiting" and that "other files finished downloading" (use
retrieve_thread.is_alive() plus a downloaded-files counter or a
retrieve_waiting_event) and only then call flock_release.set(); keep a short
timeout to fail fast if the state never appears.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9425fc62-431c-49cf-81f0-c6d06ff27e3e
📒 Files selected for processing (5)
filecache/__init__.pyfilecache/file_cache.pyfilecache/file_cache_path.pyfilecache/file_cache_source.pytests/test_file_cache.py
💤 Files with no reviewable changes (1)
- filecache/init.py
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- FileCache.__repr__: store _cache_name and show it directly so
repr(FileCache(None)) prints None and repr(FileCache('global'))
prints 'global', not the internal _filecache_-prefixed dir name
- FileCacheSource.__repr__: use (scheme, remote) to match the actual
constructor signature instead of the derived src_prefix URI
- FCPath.__repr__: include url_to_url and url_to_path when non-None
so two behaviourally different instances don't print identically
- test_stale_lock_multi/pfx: replace brittle time.sleep(0.3) with a
state-based barrier that polls until download_counter reaches
len(EXPECTED_FILENAMES)-1 and the retrieve thread is still alive,
guaranteeing the thread is in wait_to_appear before the flock is released
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Addressed the CodeRabbit review comments — details below. Implemented:
Not implemented (no longer relevant): The remaining comments (about |
Stale-lock items are now grouped by source and downloaded in parallel using source.retrieve_multi(nthreads=nthreads) instead of sequentially calling source.retrieve() in a loop. Locks are released after the parallel batch completes, matching the pattern used by the initial multi-file download path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@filecache/file_cache.py`:
- Around line 1659-1674: The wait loop incorrectly treats negative lock_timeout
as a timeout because it unconditionally compares elapsed > lock_timeout; update
the loop logic (the section setting timed_out and the condition that checks
elapsed > lock_timeout) to only perform the elapsed > lock_timeout check when
lock_timeout is non-negative (e.g., lock_timeout >= 0), so negative values mean
“never time out”; apply the same change to the analogous check around lines
1743-1745 to ensure both wait loops honor negative lock_timeout.
In `@tests/test_file_cache.py`:
- Around line 1137-1139: The call to flock_held.wait() in the test's setup can
hang if the holder_thread dies before setting the event; change the unbounded
wait to use a timeout (e.g. flock_held.wait(timeout=...) matching the later
thread.join() timeouts) so the test fails fast instead of blocking indefinitely;
apply this same change for the second occurrence in test_stale_lock_multi_pfx()
and keep the timeout value consistent with the existing join() timeouts.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: c24af9db-44c9-4968-88e1-41c05860fb75
📒 Files selected for processing (4)
filecache/file_cache.pyfilecache/file_cache_path.pyfilecache/file_cache_source.pytests/test_file_cache.py
| # In each iteration we also check for stale locks: a stale lock exists when the | ||
| # lock file is present but the process that created it has died (the OS released | ||
| # the advisory flock, but did not delete the file). We detect a stale lock by | ||
| # attempting a non-blocking acquire -- if it succeeds, no live process holds the | ||
| # lock, so we steal it and initiate the download ourselves. | ||
| # | ||
| # Race-condition guarantee: multiple waiting processes may all notice the same | ||
| # stale lock and all attempt `lock.acquire(timeout=0)` simultaneously. Because | ||
| # the underlying flock(2) call is atomic, exactly one process wins the race; the | ||
| # others receive a Timeout and remain in the wait list. The winner downloads the | ||
| # file atomically (write to temp path + rename), so the losers will find the | ||
| # completed file on their next poll iteration. | ||
| # | ||
| # If the lock file disappears without the destination file appearing, that means | ||
| # the other process failed (or cleaned up after itself on error). | ||
| timed_out = False |
There was a problem hiding this comment.
Honor negative lock_timeout in the wait loop.
The public contract says a negative timeout means “never time out,” but this loop still checks elapsed > lock_timeout. For any negative value, that becomes true on the first poll, so locked multi-file retrieves can raise TimeoutError immediately instead of waiting indefinitely.
Proposed fix
- if time.time() - start_time > lock_timeout:
+ if lock_timeout >= 0 and time.time() - start_time > lock_timeout:
exc = TimeoutError(
'Timeout while waiting for another process to finish downloading')Also applies to: 1743-1745
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@filecache/file_cache.py` around lines 1659 - 1674, The wait loop incorrectly
treats negative lock_timeout as a timeout because it unconditionally compares
elapsed > lock_timeout; update the loop logic (the section setting timed_out and
the condition that checks elapsed > lock_timeout) to only perform the elapsed >
lock_timeout check when lock_timeout is non-negative (e.g., lock_timeout >= 0),
so negative values mean “never time out”; apply the same change to the analogous
check around lines 1743-1745 to ensure both wait loops honor negative
lock_timeout.
- test_fcpath_repr_str → tests/test_file_cache_path.py - test_filecachesource_repr_str → tests/test_file_cache_source.py - test_filecache_repr_str stays in tests/test_file_cache.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
tests/test_file_cache.py (1)
1137-1139:⚠️ Potential issue | 🟡 MinorBound the initial
flock_held.wait()calls.If
holder_thread()fails before signaling the event, both tests hang the suite instead of failing diagnostically. Use the same kind of timeout you already apply to the laterjoin()calls.Suggested fix
- flock_held.wait() # make sure the flock is held before we start retrieve + assert flock_held.wait(timeout=5), 'holder thread never acquired flock' @@ - flock_held.wait() + assert flock_held.wait(timeout=5), 'holder thread never acquired flock'Also applies to: 1207-1209
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_file_cache.py` around lines 1137 - 1139, The test can hang if holder_thread() never signals flock_held; update the two places that call flock_held.wait() (the ones after starting the Thread that runs holder_thread) to use a bounded wait with a timeout (same pattern used for the later t.join(timeout=...) calls), and assert/raise a clear failure if the wait returns False so the test fails fast; locate the calls by looking for flock_held.wait() immediately after creating/starting the Thread targeting holder_thread.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_file_cache_path.py`:
- Around line 1313-1331: Add assertions to test_fcpath_repr_str to exercise
FCPath.__repr__ branches that include translator hooks: construct FCPath
instances with the new url_to_url and url_to_path keyword fields (e.g.,
FCPath(..., url_to_url=some_callable, url_to_path=some_callable)) and verify
their repr includes 'url_to_url=' and 'url_to_path=' (and any identifying
callable/name), and also test when both are None they do not appear; reference
FCPath and its __repr__ behavior and the url_to_url / url_to_path attributes in
file_cache_path.py to locate where to add these extra assertions.
In `@tests/test_file_cache_source.py`:
- Around line 409-435: Add a direct test for the base class FileCacheSource to
cover its __repr__ and __str__ implementations: instantiate FileCacheSource with
appropriate constructor args (matching how subclasses are constructed in the
file, e.g., scheme and bucket and anonymous flag), assert
repr(FileCacheSource(...)) matches the expected "FileCacheSource('scheme',
'bucket', anonymous=...)" string and assert str(FileCacheSource(...)) returns
the expected URI string (e.g., "scheme://bucket" or "file://" for empty bucket);
reference the FileCacheSource class and its __repr__/__str__ methods when adding
the new assertion to the existing test_filecachesource_repr_str test.
---
Duplicate comments:
In `@tests/test_file_cache.py`:
- Around line 1137-1139: The test can hang if holder_thread() never signals
flock_held; update the two places that call flock_held.wait() (the ones after
starting the Thread that runs holder_thread) to use a bounded wait with a
timeout (same pattern used for the later t.join(timeout=...) calls), and
assert/raise a clear failure if the wait returns False so the test fails fast;
locate the calls by looking for flock_held.wait() immediately after
creating/starting the Thread targeting holder_thread.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 605a976b-9f3b-42fe-ae37-37874ca9d711
📒 Files selected for processing (3)
tests/test_file_cache.pytests/test_file_cache_path.pytests/test_file_cache_source.py
| def test_filecachesource_repr_str(): | ||
| src_gs = FileCacheSourceGS('gs', 'rms-filecache-tests', anonymous=True) | ||
| r = repr(src_gs) | ||
| assert r == "FileCacheSourceGS('gs', 'rms-filecache-tests', anonymous=True)" | ||
| assert str(src_gs) == 'gs://rms-filecache-tests' | ||
|
|
||
| src_s3 = FileCacheSourceS3('s3', 'rms-filecache-tests', anonymous=True) | ||
| r = repr(src_s3) | ||
| assert r == "FileCacheSourceS3('s3', 'rms-filecache-tests', anonymous=True)" | ||
| assert str(src_s3) == 's3://rms-filecache-tests' | ||
|
|
||
| src_http = FileCacheSourceHTTP('https', 'storage.googleapis.com', | ||
| anonymous=False) | ||
| r = repr(src_http) | ||
| assert r == ("FileCacheSourceHTTP('https', 'storage.googleapis.com', " | ||
| "anonymous=False)") | ||
| assert str(src_http) == 'https://storage.googleapis.com' | ||
|
|
||
| src_file = FileCacheSourceFile('file', '', anonymous=False) | ||
| r = repr(src_file) | ||
| assert r == "FileCacheSourceFile('file', '', anonymous=False)" | ||
| assert str(src_file) == 'file://' | ||
|
|
||
| src_fake = FileCacheSourceFake('fake', 'fake-bucket', anonymous=False) | ||
| r = repr(src_fake) | ||
| assert r == "FileCacheSourceFake('fake', 'fake-bucket', anonymous=False)" | ||
| assert str(src_fake) == 'fake://fake-bucket' |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Add a direct FileCacheSource repr/str assertion.
This only covers subclasses, but the PR scope also includes the public base FileCacheSource. Since filecache/file_cache_source.py:75-80 defines __repr__ / __str__ there, one direct base-class case would keep that contract covered too.
Suggested test addition
def test_filecachesource_repr_str():
+ base = FileCacheSource('custom', 'remote', anonymous=False)
+ assert repr(base) == "FileCacheSource('custom', 'remote', anonymous=False)"
+ assert str(base) == 'custom://remote'
+
src_gs = FileCacheSourceGS('gs', 'rms-filecache-tests', anonymous=True)
r = repr(src_gs)
assert r == "FileCacheSourceGS('gs', 'rms-filecache-tests', anonymous=True)"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_file_cache_source.py` around lines 409 - 435, Add a direct test
for the base class FileCacheSource to cover its __repr__ and __str__
implementations: instantiate FileCacheSource with appropriate constructor args
(matching how subclasses are constructed in the file, e.g., scheme and bucket
and anonymous flag), assert repr(FileCacheSource(...)) matches the expected
"FileCacheSource('scheme', 'bucket', anonymous=...)" string and assert
str(FileCacheSource(...)) returns the expected URI string (e.g.,
"scheme://bucket" or "file://" for empty bucket); reference the FileCacheSource
class and its __repr__/__str__ methods when adding the new assertion to the
existing test_filecachesource_repr_str test.
- Fix pre-existing bug: negative lock_timeout (meaning "never time out") was immediately triggering TimeoutError because elapsed > negative is always true. Now guarded with lock_timeout >= 0. - Bound flock_held.wait() with timeout=5 in stale lock tests so they fail fast instead of hanging if the holder thread dies. - Add url_to_url/url_to_path coverage to test_fcpath_repr_str. - Skip base FileCacheSource repr test (it's an ABC, can't instantiate). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Addressed the second round of CodeRabbit review comments: Implemented:
Not implemented:
|
Summary
Fixes #56 (stale lock detection) and fixes #21 (
__repr__/__str__for public classes). Also fixes a pre-existing bug where negativelock_timeout(meaning "never time out") would immediately triggerTimeoutError.Issue #56 — Stale lock detection in
_retrieve_multi_lockedWhen a process holding a file-lock crashes, the POSIX
flockis automatically released by the kernel, but the lock file on disk remains. The previouswait_to_appearloop only testedlock_path.is_file()and would wait forever if the owner died without cleaning up its lock file.Fix: Each iteration of the polling loop now attempts a non-blocking
filelock.FileLock(lock_path, timeout=0).acquire(). If the acquire succeeds the lock file is stale — the current process steals ownership, groups stale items by source, and downloads them in parallel viaretrieve_multi. If acquire fails (the original owner is still alive), the path goes back ontowait_to_appearfor the next iteration.Race-condition guarantee:
fcntl.flock(LOCK_EX|LOCK_NB)is a single atomic kernel syscall. If two processes simultaneously detect a stale lock and both callacquire(timeout=0), exactly one wins and the other getsfilelock._error.Timeout— there is no window where both believe they hold the lock. The winner downloads to a temp file and renames it into place atomically; the loser finds the completed file on its next poll cycle._retrieve_single(single-file locked download) was already correct — thefilelock.FileLock(timeout=...)loop will eventually acquire the freed flock after a crash, with no code changes needed.New tests (
tests/test_file_cache.py):test_stale_lock_single— creates a stale lock via rawfcntl.flock, verifies retrieval succeeds.test_stale_lock_multi— background thread holds an flock; retrieve enterswait_to_appear; thread releases flock (simulating a crash without cleanup); verifies download completes. Uses a state-based barrier (pollsdownload_counter) instead oftime.sleep.test_stale_lock_multi_pfx— same scenario exercised through theFCPathinterface.fcntl).Bonus fix: The
wait_to_appeartimeout check now guards against negativelock_timeout(lock_timeout >= 0 and ...), which previously caused immediateTimeoutErrorinstead of waiting indefinitely as documented.Issue #21 —
__repr__and__str__for all public classesAdds or improves
__repr__and__str__on every public (non-internal) class:__repr____str__FileCachecache_name,anonymous,lock_timeout,nthreadsFCPathfilecache,anonymous,lock_timeout,nthreads,url_to_url,url_to_path)FileCacheSource(base)ClassName(scheme, remote, anonymous=...)— inherited by all five subclassesNew tests:
test_filecache_repr_str(intest_file_cache.py),test_fcpath_repr_str(intest_file_cache_path.py),test_filecachesource_repr_str(intest_file_cache_source.py).Test plan
pytest tests/ -x -q)__repr__/__str__tested for all five source subclasses,FileCache, andFCPath🤖 Generated with Claude Code