Skip to content

add remote HTTP regression tests#652

Merged
d-chambers merged 4 commits intodevfrom
fix_https_race_condition
Apr 6, 2026
Merged

add remote HTTP regression tests#652
d-chambers merged 4 commits intodevfrom
fix_https_race_condition

Conversation

@d-chambers
Copy link
Copy Markdown
Contributor

@d-chambers d-chambers commented Apr 4, 2026

Description

This PR fixes a race condition in the testing of http UPaths introduced in #645

I have (if applicable):

  • referenced the GitHub issue this PR closes.
  • documented the new feature with docstrings and/or appropriate doc page.
  • included tests. See testing guidelines.
  • added the "ready_for_review" tag once the PR is ready to be reviewed.

Summary by CodeRabbit

  • New Features

    • Added a configurable timeout for blocking remote file downloads (default 60s).
  • Bug Fixes / Reliability

    • HTTP downloads now stream directly, honor request headers and the configured timeout, and atomically update the cache.
    • Reader handles now wrap and deterministically manage underlying file-like resources and prefer existing local cached copies when available.
  • Tests

    • Expanded tests for remote IO, caching behavior, reader handle lifecycle, and fallback semantics.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 4, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a blocking HTTP download path with timeout and headers, a cached-local-file checker, a managed h5py handle wrapper and centralized opener, H5Reader handle-flow changes to prefer cached local artifacts, a new remote download timeout config field, and expanded tests covering these behaviors.

Changes

Cohort / File(s) Summary
Remote IO & caching
dascore/utils/remote_io.py
Adds _get_cached_local_file(resource) to return existing cached artifacts without materializing; implements blocking urllib.request.urlopen-based download for http/https that streams into the cache using get_config().remote_download_timeout, forwards headers from storage_options, and retains prior non-HTTP streaming. Renames FallbackFileObj_FallbackFileObj and clarifies fallback behavior/comments.
HDF5 handling & readers
dascore/utils/hdf5.py
Introduces _ManagedH5pyFile proxy to own/close wrapped fileobj lifecycles and centralizes construction via open_h5_resource(...) which handles passthroughs, IOBase/fileobj, and UPath (preferring cached local file when available). Updates H5Reader.get_handle to use open_h5_resource and accepts _ManagedH5pyFile fast-path. Renames decoder attr _column_decorders_column_decoders.
Configuration
dascore/config.py
Adds remote_download_timeout: float = Field(default=60.0, description="Timeout in seconds for blocking remote file downloads.") to DascoreConfig.
Tests
tests/test_utils/test_io_utils.py, tests/test_io/test_remote_http.py
Expands tests to cover open_h5_resource, _ManagedH5pyFile lifecycle and context-manager behavior, _get_cached_local_file cache preference, HTTP download behavior (uses urlopen, headers, timeout, chunked reads), and adjusts test_remote_http assertions to capture format and read outputs into locals.
🚥 Pre-merge checks | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Title check ⚠️ Warning The title 'add remote HTTP regression tests' is vague and does not accurately describe the main changes, which include configuration timeouts, HDF5 resource management, remote download improvements, and fixing a race condition. Change the title to better reflect the primary change, such as 'Fix HTTP race condition with remote timeout and managed H5 resource handling' or similar to capture the key improvements.
Docstring Coverage ⚠️ Warning Docstring coverage is 63.64% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description mentions fixing a race condition in HTTP UPath testing but lacks detail about the substantial changes to configuration, HDF5 resource management, remote downloads, and test improvements across multiple files. Expand the description to explain the race condition, the timeout configuration added, the H5 resource wrapper, and remote download improvements made to fix it.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix_https_race_condition

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2b5e5786c9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tests/test_utils/test_io_utils.py Outdated
from dascore.utils.remote_io import (
FallbackFileObj,
clear_remote_file_cache,
get_cached_local_file,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove unresolved remote_io import from test module

tests/test_utils/test_io_utils.py now imports get_cached_local_file from dascore.utils.remote_io, but that symbol is not defined anywhere in dascore/utils/remote_io.py in this commit. This causes an ImportError during pytest collection, so the entire test module fails to load before any test can run.

Useful? React with 👍 / 👎.

Comment on lines +263 to +265
monkeypatch.setattr(
"dascore.utils.hdf5.get_cached_local_file", lambda _: local_path
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Monkeypatch an existing hdf5 attribute

This test monkeypatches dascore.utils.hdf5.get_cached_local_file, but dascore/utils/hdf5.py does not define that attribute in this commit. With monkeypatch.setattr defaulting to raising=True, this line raises AttributeError and the test fails before exercising the behavior it is meant to verify.

Useful? React with 👍 / 👎.

return response

monkeypatch.setattr(remote_io, "coerce_to_upath", lambda resource: resource)
monkeypatch.setattr(remote_io, "urlopen", _fake_urlopen)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid monkeypatching missing remote_io.urlopen

remote_io does not expose a urlopen symbol in this commit, so monkeypatch.setattr(remote_io, "urlopen", _fake_urlopen) raises AttributeError (default raising=True). That makes this regression test fail immediately instead of validating HTTP download behavior.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_utils/test_io_utils.py`:
- Around line 693-729: The test fails because _download_remote_file must pass
the configured remote_download_timeout to urlopen for HTTP resources; update
remote_io._download_remote_file to use remote_io.coerce_to_upath(resource) to
get the URL string and call remote_io.urlopen(url,
timeout=get_config().remote_download_timeout) (or equivalent code that reads the
configured timeout) when resource.protocol == "http", ensuring the timeout
argument is forwarded to remote_io.urlopen and that the returned response is
read/written as before.
- Around line 565-571: The test calls a missing helper
get_cached_local_file(path); add a small wrapper in the IO utils that provides
that symbol (e.g., def get_cached_local_file(path): return
ensure_local_file(path)) so the test can import and compare the cached/local
path; reference the existing ensure_local_file implementation (call or delegate
to it) and export get_cached_local_file so tests/test_utils/test_io_utils.py can
use it.
- Line 39: The test imports get_cached_local_file from dascore.utils.remote_io
but that symbol doesn't exist; either add a correctly implemented
get_cached_local_file function to dascore.utils.remote_io (exported at module
level) or update the test to import the actual helper provided by the module
(e.g., the existing function name in dascore.utils.remote_io). Locate the import
in the test (get_cached_local_file) and either implement and expose that
function in the remote_io module (matching expected signature/behavior used by
the tests) or change the test import to the existing function name and adjust
calls accordingly.
- Around line 639-692: The test fails because _download_remote_file currently
always calls resource.open(...), but the test (and desired behavior) expects
HTTP resources (resource.protocol == "http") to use urllib-like urlopen; modify
_download_remote_file to detect when resource.protocol == "http" (after coercing
via remote_io.coerce_to_upath) and in that branch construct a
urllib.request.Request using the resource.__str__() URL and headers pulled from
resource.storage_options (map "User-Agent" -> "User-agent" if needed), call
remote_io.urlopen(request, timeout=timeout) (ensure urlopen is imported or
referenced via remote_io), read from the returned response in chunks of size
remote_download_block_size and write to the local file, supporting the context
manager protocol (with response:) and recording read sizes exactly as the test
asserts; leave non-HTTP paths to continue calling resource.open("rb",
**open_kwargs).
- Around line 256-279: Implement the missing helper and prefer local cache: add
a get_cached_local_file function in dascore.utils.remote_io that, given a remote
UPath, returns the already-materialized local Path or None; then update
H5Reader.get_handle (or the FallbackFileObj it constructs) to query
get_cached_local_file(path) first and, if a local_path is returned, open and
return that local file handle immediately instead of calling resource.open();
leave the existing fallback-to-remote logic (and is_no_range_http_error
handling) intact for cases where no cached local file exists.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2a629dc6-a252-46b0-9195-1b9badef7f7f

📥 Commits

Reviewing files that changed from the base of the PR and between 35d6b82 and 2b5e578.

📒 Files selected for processing (2)
  • tests/test_io/test_remote_http.py
  • tests/test_utils/test_io_utils.py

Comment thread tests/test_utils/test_io_utils.py Outdated
Comment thread tests/test_utils/test_io_utils.py
Comment thread tests/test_utils/test_io_utils.py
Comment thread tests/test_utils/test_io_utils.py
Comment on lines +693 to +729
def test_http_remote_download_uses_configured_timeout(self, monkeypatch, tmp_path):
"""HTTP cache downloads should pass through the configured timeout."""

class _HTTPResource:
def __init__(self):
self.protocol = "http"
self.storage_options = {}

def __str__(self):
return "http://example.com/data.bin"

class _HTTPResponse:
def read(self, _size=-1):
return b""

def __enter__(self):
return self

def __exit__(self, *_args):
return False

seen = {}

def _fake_urlopen(request, timeout=None):
seen["url"] = request.full_url
seen["timeout"] = timeout
return _HTTPResponse()

monkeypatch.setattr(remote_io, "coerce_to_upath", lambda resource: resource)
monkeypatch.setattr(remote_io, "urlopen", _fake_urlopen)
with set_config(remote_download_timeout=12.5):
local_path = tmp_path / "downloaded.bin"
remote_io._download_remote_file(_HTTPResource(), local_path)

assert seen == {"url": "http://example.com/data.bin", "timeout": 12.5}
assert local_path.read_bytes() == b""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Test expects HTTP-specific download timeout behavior.

Like the previous test, this verifies that HTTP downloads pass a timeout parameter to urlopen. The same concern applies: if the current implementation doesn't use urlopen, this test will fail.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_utils/test_io_utils.py` around lines 693 - 729, The test fails
because _download_remote_file must pass the configured remote_download_timeout
to urlopen for HTTP resources; update remote_io._download_remote_file to use
remote_io.coerce_to_upath(resource) to get the URL string and call
remote_io.urlopen(url, timeout=get_config().remote_download_timeout) (or
equivalent code that reads the configured timeout) when resource.protocol ==
"http", ensuring the timeout argument is forwarded to remote_io.urlopen and that
the returned response is read/written as before.

@d-chambers d-chambers force-pushed the fix_https_race_condition branch from 2b5e578 to 23b4f98 Compare April 4, 2026 15:45
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (e656799) to head (30aa026).
⚠️ Report is 3 commits behind head on dev.

Additional details and impacted files
@@            Coverage Diff             @@
##               dev      #652    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files          134       137     +3     
  Lines        12042     12711   +669     
==========================================
+ Hits         12042     12711   +669     
Flag Coverage Δ
unittests 100.00% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
dascore/utils/hdf5.py (1)

238-244: ⚠️ Potential issue | 🔴 Critical

Fix decoder attribute name mismatch to avoid runtime failure.

After this rename, decode_table still uses self._column_decorders (Line 306), which will raise AttributeError on read.

🐛 Proposed fix
 def decode_table(self, df):
     """Decode the table from hdf5."""
     # ensure the base path is not in the path column
-    for col, func in self._column_decorders.items():
+    for col, func in self._column_decoders.items():
         df[col] = func(df[col])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dascore/utils/hdf5.py` around lines 238 - 244, The code defines
_column_decoders but decode_table references the misspelled
self._column_decorders causing AttributeError; update decode_table to use the
correct attribute name (_column_decoders) (or else rename the dict to
_column_decorders to match) so that decode_table, ns_to_datetime and
ns_to_timedelta mappings are used correctly (look for the decode_table method
and the self._column_decorders reference and replace it with
self._column_decoders).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@dascore/utils/hdf5.py`:
- Around line 238-244: The code defines _column_decoders but decode_table
references the misspelled self._column_decorders causing AttributeError; update
decode_table to use the correct attribute name (_column_decoders) (or else
rename the dict to _column_decorders to match) so that decode_table,
ns_to_datetime and ns_to_timedelta mappings are used correctly (look for the
decode_table method and the self._column_decorders reference and replace it with
self._column_decoders).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: df8f4802-a904-40b6-8c23-52863c9418ad

📥 Commits

Reviewing files that changed from the base of the PR and between 23b4f98 and e3ec015.

📒 Files selected for processing (4)
  • dascore/config.py
  • dascore/utils/hdf5.py
  • dascore/utils/remote_io.py
  • tests/test_utils/test_io_utils.py

@d-chambers d-chambers force-pushed the fix_https_race_condition branch from e3ec015 to 11a7ef8 Compare April 4, 2026 19:10
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
dascore/utils/hdf5.py (1)

180-188: Use exception chaining for clearer tracebacks.

The static analysis hint is valid. When re-raising as NotImplementedError, chain from the original TypeError to preserve the diagnostic context.

♻️ Proposed fix
     try:
         _maybe_make_parent_directory(resource)
         return _ManagedH5pyFile(constructor(resource, mode=mode))
     except TypeError:
         msg = f"Couldn't get handle from {resource} using h5py"
-        raise NotImplementedError(msg)
+        raise NotImplementedError(msg) from None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dascore/utils/hdf5.py` around lines 180 - 188, The except TypeError block
should preserve the original exception for proper chaining: capture the
TypeError as e (e.g., except TypeError as e) and re-raise the
NotImplementedError using exception chaining (raise NotImplementedError(msg)
from e) so the traceback includes the original TypeError raised by
constructor(resource, mode=mode) when creating the _ManagedH5pyFile; leave the
surrounding _maybe_make_parent_directory(resource) and return
_ManagedH5pyFile(constructor(resource, mode=mode)) logic intact.
dascore/utils/remote_io.py (1)

143-147: Consider handling urlopen timeout and connection errors gracefully.

The urlopen call can raise urllib.error.URLError, socket.timeout, or http.client.HTTPException. These exceptions will propagate up, but the error messages may not be user-friendly. Consider wrapping with a more descriptive exception for better diagnostics.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dascore/utils/remote_io.py` around lines 143 - 147, Wrap the urlopen call in
remote_io.py (the block constructing Request(str(resource), headers=headers) and
using urlopen(request, timeout=timeout) as remote_fi) with a try/except that
catches urllib.error.URLError, socket.timeout, and http.client.HTTPException and
re-raises a more descriptive exception (or raises a custom RuntimeError) that
includes the resource URL (str(resource)) and the timeout value; ensure the new
message provides clear guidance (e.g., "failed to fetch remote resource <url>
within <timeout>s" or "connection error fetching <url>") while preserving the
original exception as the __cause__ for debugging.
tests/test_utils/test_io_utils.py (1)

1018-1042: Note: Test depends on clear_remote_file_cache() call.

Line 1020 calls clear_remote_file_cache() at the start of the test. While this works, the TestIOResourceManager class already has an autouse fixture (clear_remote_cache) that clears the cache. This test is in TestRemoteIOFallback, which doesn't have that fixture.

This is fine as-is, but if the test class grows, consider adding an autouse fixture similar to TestIOResourceManager.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_utils/test_io_utils.py` around lines 1018 - 1042, Test calls
clear_remote_file_cache() at the start of
test_h5_reader_warns_when_no_range_fallback_downloads; instead add an autouse
fixture to the TestRemoteIOFallback test class (mirroring
TestIOResourceManager's clear_remote_cache fixture) that calls
clear_remote_file_cache() before each test, and then remove the explicit
clear_remote_file_cache() call from
test_h5_reader_warns_when_no_range_fallback_downloads; reference
TestRemoteIOFallback, clear_remote_cache fixture, and clear_remote_file_cache()
when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dascore/utils/remote_io.py`:
- Around line 138-155: The branch that calls urlopen(Request(str(resource),
...)) trusts the earlier `protocol` check but should re-validate the actual URL
string to avoid unsafe schemes; before creating the Request and calling
`urlopen` (inside the branch that checks `protocol in _HTTP_PROTOCOLS`), parse
the URL from `str(resource)` (e.g. via urllib.parse.urlparse) and confirm its
parsed.scheme is one of `_HTTP_PROTOCOLS`; if not, fall back to the safe
`resource.open("rb")` path or raise a clear exception. Ensure this check happens
immediately before `Request(...)`/`urlopen(...)` so `remote_fi` is only opened
for validated HTTP/HTTPS URLs.

---

Nitpick comments:
In `@dascore/utils/hdf5.py`:
- Around line 180-188: The except TypeError block should preserve the original
exception for proper chaining: capture the TypeError as e (e.g., except
TypeError as e) and re-raise the NotImplementedError using exception chaining
(raise NotImplementedError(msg) from e) so the traceback includes the original
TypeError raised by constructor(resource, mode=mode) when creating the
_ManagedH5pyFile; leave the surrounding _maybe_make_parent_directory(resource)
and return _ManagedH5pyFile(constructor(resource, mode=mode)) logic intact.

In `@dascore/utils/remote_io.py`:
- Around line 143-147: Wrap the urlopen call in remote_io.py (the block
constructing Request(str(resource), headers=headers) and using urlopen(request,
timeout=timeout) as remote_fi) with a try/except that catches
urllib.error.URLError, socket.timeout, and http.client.HTTPException and
re-raises a more descriptive exception (or raises a custom RuntimeError) that
includes the resource URL (str(resource)) and the timeout value; ensure the new
message provides clear guidance (e.g., "failed to fetch remote resource <url>
within <timeout>s" or "connection error fetching <url>") while preserving the
original exception as the __cause__ for debugging.

In `@tests/test_utils/test_io_utils.py`:
- Around line 1018-1042: Test calls clear_remote_file_cache() at the start of
test_h5_reader_warns_when_no_range_fallback_downloads; instead add an autouse
fixture to the TestRemoteIOFallback test class (mirroring
TestIOResourceManager's clear_remote_cache fixture) that calls
clear_remote_file_cache() before each test, and then remove the explicit
clear_remote_file_cache() call from
test_h5_reader_warns_when_no_range_fallback_downloads; reference
TestRemoteIOFallback, clear_remote_cache fixture, and clear_remote_file_cache()
when making the change.
🪄 Autofix (Beta)

✅ Autofix completed


ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d8db24d5-df00-476e-885c-48c01b140690

📥 Commits

Reviewing files that changed from the base of the PR and between 11a7ef8 and 3b0e4b0.

📒 Files selected for processing (3)
  • dascore/utils/hdf5.py
  • dascore/utils/remote_io.py
  • tests/test_utils/test_io_utils.py

Comment thread dascore/utils/remote_io.py
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 6, 2026

Note

Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it.

Fixes Applied Successfully

Fixed 1 file(s) based on 1 unresolved review comment.

Files modified:

  • dascore/utils/remote_io.py

Commit: d10902e20f07b6346c84003c4c597c39ca916cef

The changes have been pushed to the fix_https_race_condition branch.

Time taken: 2m 38s

coderabbitai Bot and others added 2 commits April 6, 2026 05:30
Fixed 1 file(s) based on 1 unresolved review comment.

Co-authored-by: CodeRabbit <noreply@coderabbit.ai>
@d-chambers d-chambers merged commit 7224df8 into dev Apr 6, 2026
25 checks passed
@d-chambers d-chambers deleted the fix_https_race_condition branch April 6, 2026 06:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant