add remote HTTP regression tests#652
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a blocking HTTP download path with timeout and headers, a cached-local-file checker, a managed h5py handle wrapper and centralized opener, H5Reader handle-flow changes to prefer cached local artifacts, a new remote download timeout config field, and expanded tests covering these behaviors. Changes
🚥 Pre-merge checks | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2b5e5786c9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| from dascore.utils.remote_io import ( | ||
| FallbackFileObj, | ||
| clear_remote_file_cache, | ||
| get_cached_local_file, |
There was a problem hiding this comment.
Remove unresolved remote_io import from test module
tests/test_utils/test_io_utils.py now imports get_cached_local_file from dascore.utils.remote_io, but that symbol is not defined anywhere in dascore/utils/remote_io.py in this commit. This causes an ImportError during pytest collection, so the entire test module fails to load before any test can run.
Useful? React with 👍 / 👎.
| monkeypatch.setattr( | ||
| "dascore.utils.hdf5.get_cached_local_file", lambda _: local_path | ||
| ) |
There was a problem hiding this comment.
Monkeypatch an existing hdf5 attribute
This test monkeypatches dascore.utils.hdf5.get_cached_local_file, but dascore/utils/hdf5.py does not define that attribute in this commit. With monkeypatch.setattr defaulting to raising=True, this line raises AttributeError and the test fails before exercising the behavior it is meant to verify.
Useful? React with 👍 / 👎.
| return response | ||
|
|
||
| monkeypatch.setattr(remote_io, "coerce_to_upath", lambda resource: resource) | ||
| monkeypatch.setattr(remote_io, "urlopen", _fake_urlopen) |
There was a problem hiding this comment.
Avoid monkeypatching missing remote_io.urlopen
remote_io does not expose a urlopen symbol in this commit, so monkeypatch.setattr(remote_io, "urlopen", _fake_urlopen) raises AttributeError (default raising=True). That makes this regression test fail immediately instead of validating HTTP download behavior.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_utils/test_io_utils.py`:
- Around line 693-729: The test fails because _download_remote_file must pass
the configured remote_download_timeout to urlopen for HTTP resources; update
remote_io._download_remote_file to use remote_io.coerce_to_upath(resource) to
get the URL string and call remote_io.urlopen(url,
timeout=get_config().remote_download_timeout) (or equivalent code that reads the
configured timeout) when resource.protocol == "http", ensuring the timeout
argument is forwarded to remote_io.urlopen and that the returned response is
read/written as before.
- Around line 565-571: The test calls a missing helper
get_cached_local_file(path); add a small wrapper in the IO utils that provides
that symbol (e.g., def get_cached_local_file(path): return
ensure_local_file(path)) so the test can import and compare the cached/local
path; reference the existing ensure_local_file implementation (call or delegate
to it) and export get_cached_local_file so tests/test_utils/test_io_utils.py can
use it.
- Line 39: The test imports get_cached_local_file from dascore.utils.remote_io
but that symbol doesn't exist; either add a correctly implemented
get_cached_local_file function to dascore.utils.remote_io (exported at module
level) or update the test to import the actual helper provided by the module
(e.g., the existing function name in dascore.utils.remote_io). Locate the import
in the test (get_cached_local_file) and either implement and expose that
function in the remote_io module (matching expected signature/behavior used by
the tests) or change the test import to the existing function name and adjust
calls accordingly.
- Around line 639-692: The test fails because _download_remote_file currently
always calls resource.open(...), but the test (and desired behavior) expects
HTTP resources (resource.protocol == "http") to use urllib-like urlopen; modify
_download_remote_file to detect when resource.protocol == "http" (after coercing
via remote_io.coerce_to_upath) and in that branch construct a
urllib.request.Request using the resource.__str__() URL and headers pulled from
resource.storage_options (map "User-Agent" -> "User-agent" if needed), call
remote_io.urlopen(request, timeout=timeout) (ensure urlopen is imported or
referenced via remote_io), read from the returned response in chunks of size
remote_download_block_size and write to the local file, supporting the context
manager protocol (with response:) and recording read sizes exactly as the test
asserts; leave non-HTTP paths to continue calling resource.open("rb",
**open_kwargs).
- Around line 256-279: Implement the missing helper and prefer local cache: add
a get_cached_local_file function in dascore.utils.remote_io that, given a remote
UPath, returns the already-materialized local Path or None; then update
H5Reader.get_handle (or the FallbackFileObj it constructs) to query
get_cached_local_file(path) first and, if a local_path is returned, open and
return that local file handle immediately instead of calling resource.open();
leave the existing fallback-to-remote logic (and is_no_range_http_error
handling) intact for cases where no cached local file exists.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2a629dc6-a252-46b0-9195-1b9badef7f7f
📒 Files selected for processing (2)
tests/test_io/test_remote_http.pytests/test_utils/test_io_utils.py
| def test_http_remote_download_uses_configured_timeout(self, monkeypatch, tmp_path): | ||
| """HTTP cache downloads should pass through the configured timeout.""" | ||
|
|
||
| class _HTTPResource: | ||
| def __init__(self): | ||
| self.protocol = "http" | ||
| self.storage_options = {} | ||
|
|
||
| def __str__(self): | ||
| return "http://example.com/data.bin" | ||
|
|
||
| class _HTTPResponse: | ||
| def read(self, _size=-1): | ||
| return b"" | ||
|
|
||
| def __enter__(self): | ||
| return self | ||
|
|
||
| def __exit__(self, *_args): | ||
| return False | ||
|
|
||
| seen = {} | ||
|
|
||
| def _fake_urlopen(request, timeout=None): | ||
| seen["url"] = request.full_url | ||
| seen["timeout"] = timeout | ||
| return _HTTPResponse() | ||
|
|
||
| monkeypatch.setattr(remote_io, "coerce_to_upath", lambda resource: resource) | ||
| monkeypatch.setattr(remote_io, "urlopen", _fake_urlopen) | ||
| with set_config(remote_download_timeout=12.5): | ||
| local_path = tmp_path / "downloaded.bin" | ||
| remote_io._download_remote_file(_HTTPResource(), local_path) | ||
|
|
||
| assert seen == {"url": "http://example.com/data.bin", "timeout": 12.5} | ||
| assert local_path.read_bytes() == b"" | ||
|
|
There was a problem hiding this comment.
Test expects HTTP-specific download timeout behavior.
Like the previous test, this verifies that HTTP downloads pass a timeout parameter to urlopen. The same concern applies: if the current implementation doesn't use urlopen, this test will fail.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_utils/test_io_utils.py` around lines 693 - 729, The test fails
because _download_remote_file must pass the configured remote_download_timeout
to urlopen for HTTP resources; update remote_io._download_remote_file to use
remote_io.coerce_to_upath(resource) to get the URL string and call
remote_io.urlopen(url, timeout=get_config().remote_download_timeout) (or
equivalent code that reads the configured timeout) when resource.protocol ==
"http", ensuring the timeout argument is forwarded to remote_io.urlopen and that
the returned response is read/written as before.
2b5e578 to
23b4f98
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## dev #652 +/- ##
==========================================
Coverage 100.00% 100.00%
==========================================
Files 134 137 +3
Lines 12042 12711 +669
==========================================
+ Hits 12042 12711 +669
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
dascore/utils/hdf5.py (1)
238-244:⚠️ Potential issue | 🔴 CriticalFix decoder attribute name mismatch to avoid runtime failure.
After this rename,
decode_tablestill usesself._column_decorders(Line 306), which will raiseAttributeErroron read.🐛 Proposed fix
def decode_table(self, df): """Decode the table from hdf5.""" # ensure the base path is not in the path column - for col, func in self._column_decorders.items(): + for col, func in self._column_decoders.items(): df[col] = func(df[col])🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dascore/utils/hdf5.py` around lines 238 - 244, The code defines _column_decoders but decode_table references the misspelled self._column_decorders causing AttributeError; update decode_table to use the correct attribute name (_column_decoders) (or else rename the dict to _column_decorders to match) so that decode_table, ns_to_datetime and ns_to_timedelta mappings are used correctly (look for the decode_table method and the self._column_decorders reference and replace it with self._column_decoders).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@dascore/utils/hdf5.py`:
- Around line 238-244: The code defines _column_decoders but decode_table
references the misspelled self._column_decorders causing AttributeError; update
decode_table to use the correct attribute name (_column_decoders) (or else
rename the dict to _column_decorders to match) so that decode_table,
ns_to_datetime and ns_to_timedelta mappings are used correctly (look for the
decode_table method and the self._column_decorders reference and replace it with
self._column_decoders).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: df8f4802-a904-40b6-8c23-52863c9418ad
📒 Files selected for processing (4)
dascore/config.pydascore/utils/hdf5.pydascore/utils/remote_io.pytests/test_utils/test_io_utils.py
e3ec015 to
11a7ef8
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
dascore/utils/hdf5.py (1)
180-188: Use exception chaining for clearer tracebacks.The static analysis hint is valid. When re-raising as
NotImplementedError, chain from the originalTypeErrorto preserve the diagnostic context.♻️ Proposed fix
try: _maybe_make_parent_directory(resource) return _ManagedH5pyFile(constructor(resource, mode=mode)) except TypeError: msg = f"Couldn't get handle from {resource} using h5py" - raise NotImplementedError(msg) + raise NotImplementedError(msg) from None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dascore/utils/hdf5.py` around lines 180 - 188, The except TypeError block should preserve the original exception for proper chaining: capture the TypeError as e (e.g., except TypeError as e) and re-raise the NotImplementedError using exception chaining (raise NotImplementedError(msg) from e) so the traceback includes the original TypeError raised by constructor(resource, mode=mode) when creating the _ManagedH5pyFile; leave the surrounding _maybe_make_parent_directory(resource) and return _ManagedH5pyFile(constructor(resource, mode=mode)) logic intact.dascore/utils/remote_io.py (1)
143-147: Consider handlingurlopentimeout and connection errors gracefully.The
urlopencall can raiseurllib.error.URLError,socket.timeout, orhttp.client.HTTPException. These exceptions will propagate up, but the error messages may not be user-friendly. Consider wrapping with a more descriptive exception for better diagnostics.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dascore/utils/remote_io.py` around lines 143 - 147, Wrap the urlopen call in remote_io.py (the block constructing Request(str(resource), headers=headers) and using urlopen(request, timeout=timeout) as remote_fi) with a try/except that catches urllib.error.URLError, socket.timeout, and http.client.HTTPException and re-raises a more descriptive exception (or raises a custom RuntimeError) that includes the resource URL (str(resource)) and the timeout value; ensure the new message provides clear guidance (e.g., "failed to fetch remote resource <url> within <timeout>s" or "connection error fetching <url>") while preserving the original exception as the __cause__ for debugging.tests/test_utils/test_io_utils.py (1)
1018-1042: Note: Test depends onclear_remote_file_cache()call.Line 1020 calls
clear_remote_file_cache()at the start of the test. While this works, theTestIOResourceManagerclass already has anautousefixture (clear_remote_cache) that clears the cache. This test is inTestRemoteIOFallback, which doesn't have that fixture.This is fine as-is, but if the test class grows, consider adding an autouse fixture similar to
TestIOResourceManager.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_utils/test_io_utils.py` around lines 1018 - 1042, Test calls clear_remote_file_cache() at the start of test_h5_reader_warns_when_no_range_fallback_downloads; instead add an autouse fixture to the TestRemoteIOFallback test class (mirroring TestIOResourceManager's clear_remote_cache fixture) that calls clear_remote_file_cache() before each test, and then remove the explicit clear_remote_file_cache() call from test_h5_reader_warns_when_no_range_fallback_downloads; reference TestRemoteIOFallback, clear_remote_cache fixture, and clear_remote_file_cache() when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@dascore/utils/remote_io.py`:
- Around line 138-155: The branch that calls urlopen(Request(str(resource),
...)) trusts the earlier `protocol` check but should re-validate the actual URL
string to avoid unsafe schemes; before creating the Request and calling
`urlopen` (inside the branch that checks `protocol in _HTTP_PROTOCOLS`), parse
the URL from `str(resource)` (e.g. via urllib.parse.urlparse) and confirm its
parsed.scheme is one of `_HTTP_PROTOCOLS`; if not, fall back to the safe
`resource.open("rb")` path or raise a clear exception. Ensure this check happens
immediately before `Request(...)`/`urlopen(...)` so `remote_fi` is only opened
for validated HTTP/HTTPS URLs.
---
Nitpick comments:
In `@dascore/utils/hdf5.py`:
- Around line 180-188: The except TypeError block should preserve the original
exception for proper chaining: capture the TypeError as e (e.g., except
TypeError as e) and re-raise the NotImplementedError using exception chaining
(raise NotImplementedError(msg) from e) so the traceback includes the original
TypeError raised by constructor(resource, mode=mode) when creating the
_ManagedH5pyFile; leave the surrounding _maybe_make_parent_directory(resource)
and return _ManagedH5pyFile(constructor(resource, mode=mode)) logic intact.
In `@dascore/utils/remote_io.py`:
- Around line 143-147: Wrap the urlopen call in remote_io.py (the block
constructing Request(str(resource), headers=headers) and using urlopen(request,
timeout=timeout) as remote_fi) with a try/except that catches
urllib.error.URLError, socket.timeout, and http.client.HTTPException and
re-raises a more descriptive exception (or raises a custom RuntimeError) that
includes the resource URL (str(resource)) and the timeout value; ensure the new
message provides clear guidance (e.g., "failed to fetch remote resource <url>
within <timeout>s" or "connection error fetching <url>") while preserving the
original exception as the __cause__ for debugging.
In `@tests/test_utils/test_io_utils.py`:
- Around line 1018-1042: Test calls clear_remote_file_cache() at the start of
test_h5_reader_warns_when_no_range_fallback_downloads; instead add an autouse
fixture to the TestRemoteIOFallback test class (mirroring
TestIOResourceManager's clear_remote_cache fixture) that calls
clear_remote_file_cache() before each test, and then remove the explicit
clear_remote_file_cache() call from
test_h5_reader_warns_when_no_range_fallback_downloads; reference
TestRemoteIOFallback, clear_remote_cache fixture, and clear_remote_file_cache()
when making the change.
🪄 Autofix (Beta)
✅ Autofix completed
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d8db24d5-df00-476e-885c-48c01b140690
📒 Files selected for processing (3)
dascore/utils/hdf5.pydascore/utils/remote_io.pytests/test_utils/test_io_utils.py
|
Note Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it. Fixes Applied SuccessfullyFixed 1 file(s) based on 1 unresolved review comment. Files modified:
Commit: The changes have been pushed to the Time taken: |
Fixed 1 file(s) based on 1 unresolved review comment. Co-authored-by: CodeRabbit <noreply@coderabbit.ai>
Description
This PR fixes a race condition in the testing of http UPaths introduced in #645
I have (if applicable):
Summary by CodeRabbit
New Features
Bug Fixes / Reliability
Tests