fix: dataset: fix browsing datasets from private buckets#957
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughExtends storage backend link generation to accept per-bucket credential overrides (override_url, addressing_style), threads those into dataset upload/migration and service-layer URL construction, adds manifest and file-content endpoints, and updates frontend proxy, preview UI, hooks, tests, and mocks to use credential-aware proxied storage URLs. ChangesCredential-scoped link building → service-proxied manifest & preview
Sequence Diagram(s)sequenceDiagram
autonumber
participant Browser as User Browser (UI)
participant Preview as FilePreviewPanel
participant Frontend as Next App Proxy
participant Service as Data Service
participant Storage as Storage Backend
Browser->>Preview: open preview (bucket, datasetName, file)
Preview->>Frontend: HEAD/GET /app/proxy?bucket&name&storagePath
Frontend->>Service: GET /{bucket}/dataset/{name}/file-content (forwards auth)
Service->>Storage: storage client reads object using resolved credential (override_url, addressing_style)
Storage-->>Service: object bytes + metadata
Service-->>Frontend: proxied response with content headers
Frontend-->>Preview: proxied response
Preview-->>Browser: render preview using proxy URL
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #957 +/- ##
=======================================
Coverage 42.93% 42.93%
=======================================
Files 218 217 -1
Lines 28458 28489 +31
Branches 4255 4256 +1
=======================================
+ Hits 12218 12232 +14
- Misses 15599 15615 +16
- Partials 641 642 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/data/storage/backends/backends.py`:
- Around line 481-487: The URL-building branch that handles override_url
currently drops any path prefix (parsed.path) and thus breaks reverse-proxied S3
endpoints; update the logic that builds the return strings (both the path-style
and virtual-host-style branches using addressing_style, scheme, host,
self.container, self.path) to include parsed.path.rstrip('/') as a base path
prefix (ensure you insert a single '/' between the base path and subsequent
segments and still call .rstrip('/') on the final URL) so that an override like
"https://gateway.example.com/s3" yields
"https://gateway.example.com/s3/<container>/<path>" (path-style) or
"https://<container>.gateway.example.com/s3/<path>" (virtual-host-style) as
appropriate.
In `@src/service/core/data/data_service.py`:
- Around line 1051-1056: The current check only compares
requested_backend.container to dataset_backend.container which allows any object
in the same container; update validation in the code that calls
storage.construct_storage_backend (where requested_backend, dataset_backend,
storage_path and dataset_info.hash_location are used) to also verify that the
requested storage_path begins with the dataset_info.hash_location prefix (or its
normalized equivalent) before allowing access, and raise the same
osmo_errors.OSMOUserError if the path is outside that prefix; ensure you
normalize/strip leading slashes or URL schemes consistently when comparing
prefixes so comparisons are robust across inputs.
In `@src/ui/src/app/proxy/dataset/file/route.impl.production.ts`:
- Around line 72-80: The legacy URL fallback in route.impl.production.ts (the
searchParams.get("url") branch that sets the url variable and returns it) is an
SSRF risk; either remove this fallback entirely or tighten validation: parse the
provided URL, require http(s), then enforce a strict allowlist of known storage
hostnames/domains (and optionally specific ports) and reject anything else with
a 400; if you must keep broader support also perform DNS/IP resolution and block
private/loopback/metadata IP ranges (169.254.x.x, 127.x.x.x, 10/172.16/192.168,
::1, etc.) before returning { url } so only explicitly permitted storage hosts
can be fetched.
- Around line 35-44: The proxy currently drops range semantics: update the
request-forwarding logic to include the incoming "range" header when building
the upstream request (read request.headers.get("range") and set it on the
fetch/init headers) and extend the response header passthrough so
forwardResponseHeaders (and the equivalent block at 87-100) also returns
"accept-ranges" and "content-range" (add those names to FORWARDED_HEADERS or
explicitly copy them) so 206/Range responses are preserved through the proxy.
In `@src/ui/src/lib/api/adapter/datasets.ts`:
- Around line 510-519: fetchDatasetFiles currently imports the server-only
module "@/lib/api/server/dataset-actions.production" (and its fetchManifest)
which breaks when called from the client via useDatasetFiles; instead, replace
that server import with a client-side HTTP call to the manifest API endpoint
(e.g. fetch(`/api/bucket/${bucket}/dataset/${name}/version/${version}/manifest`)
or your equivalent route), await and parse the JSON into RawFileItem[] and then
pass it to processManifestItems; keep the early-return for null location and
ensure the fetch includes any required credentials/headers for the client
request.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2896f6e1-d95c-4993-a21e-0aa53f0ca5b3
📒 Files selected for processing (15)
src/lib/data/dataset/migrating.pysrc/lib/data/dataset/uploading.pysrc/lib/data/storage/backends/backends.pysrc/lib/data/storage/backends/common.pysrc/lib/data/storage/backends/tests/test_backends.pysrc/service/core/data/data_service.pysrc/ui/next.config.tssrc/ui/src/app/proxy/dataset/file/route.impl.production.tssrc/ui/src/features/datasets/detail/components/dataset-detail-content.tsxsrc/ui/src/features/datasets/detail/components/file-preview-panel.tsxsrc/ui/src/lib/api/adapter/datasets-hooks.tssrc/ui/src/lib/api/adapter/datasets.tssrc/ui/src/lib/api/server/dataset-actions.production.tssrc/ui/src/lib/api/server/dataset-actions.tssrc/ui/src/mocks/handlers.ts
💤 Files with no reviewable changes (2)
- src/ui/src/lib/api/server/dataset-actions.ts
- src/ui/next.config.ts
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/service/core/data/data_service.py (1)
1265-1273:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winInconsistent
parse_uri_to_linkcall missing override parameters.The
query_datasetfunction's metadata-enabled path callsparse_uri_to_linkwith only theregionparameter, unlikeget_collection_info(line 134) andget_dataset_info(line 212) which now passoverride_urlandaddressing_style. This would produce incorrect URLs for private S3-compatible buckets when querying datasets.Proposed fix for consistency
+ default_cred = bucket_config.default_credential + override_url = default_cred.override_url if default_cred else None + addressing_style = default_cred.addressing_style if default_cred else None + if query_term.metadata_enabled: for row in dataset_rows: dataset_infos.append(objects.DataInfoDatasetEntry( ... location=storage.construct_storage_backend(row.location)\ - .parse_uri_to_link(bucket_config.region), + .parse_uri_to_link( + bucket_config.region, + override_url=override_url, + addressing_style=addressing_style, + ), ...🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/service/core/data/data_service.py` around lines 1265 - 1273, The metadata-enabled branch in query_dataset constructs the storage backend link using parse_uri_to_link with only bucket_config.region; update this call to pass the same override parameters as get_collection_info and get_dataset_info by including bucket_config.override_url and bucket_config.addressing_style when calling storage.construct_storage_backend(...).parse_uri_to_link so private S3-compatible buckets build correct URLs; locate the call in query_dataset and mirror the parameter order/usage from get_collection_info/get_dataset_info.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/ui/src/app/proxy/dataset/file/route.impl.production.ts`:
- Around line 36-44: The proxy currently forwards upstream cache directives by
including "cache-control" in FORWARDED_RESPONSE_HEADERS; change the proxy logic
so it no longer forwards the upstream Cache-Control (remove "cache-control" from
FORWARDED_RESPONSE_HEADERS) and instead sets the response header to
"Cache-Control: private, no-store" for this authenticated route (and any other
places using the same forwarding list). Also ensure the Vary header is not
propagated from upstream (drop or overwrite "vary") so caching cannot vary by
upstream values; update the response handling code that uses
FORWARDED_RESPONSE_HEADERS to explicitly set these headers for auth-dependent
responses (reference FORWARDED_RESPONSE_HEADERS and the response-forwarding
logic in this route implementation).
In `@src/ui/src/features/datasets/detail/components/file-preview-panel.tsx`:
- Around line 349-358: The query that fetches HEAD results (useQuery with
queryKey ["file-preview-head", proxyUrl] and queryFn -> fetchHeadResult)
currently sets staleTime: Infinity which caches auth-sensitive 401/403 responses
forever; change staleTime to 0 so results are revalidated against current auth
headers (keep enabled: !!previewSource and the same
queryKey/previewSource/fetchHeadResult logic intact).
---
Outside diff comments:
In `@src/service/core/data/data_service.py`:
- Around line 1265-1273: The metadata-enabled branch in query_dataset constructs
the storage backend link using parse_uri_to_link with only bucket_config.region;
update this call to pass the same override parameters as get_collection_info and
get_dataset_info by including bucket_config.override_url and
bucket_config.addressing_style when calling
storage.construct_storage_backend(...).parse_uri_to_link so private
S3-compatible buckets build correct URLs; locate the call in query_dataset and
mirror the parameter order/usage from get_collection_info/get_dataset_info.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: af547cc3-7f82-4d7c-801b-cb373443e8c1
📒 Files selected for processing (5)
src/lib/data/storage/backends/backends.pysrc/lib/data/storage/backends/tests/test_backends.pysrc/service/core/data/data_service.pysrc/ui/src/app/proxy/dataset/file/route.impl.production.tssrc/ui/src/features/datasets/detail/components/file-preview-panel.tsx
…le backends
S3Backend.parse_uri_to_link hardcoded the AWS pattern
'https://<bucket>.s3.<region>.amazonaws.com/...' regardless of credential
override_url. The UI consumes these URLs to fetch dataset content; for a
CAIOS-backed bucket this produced
'osmo-on-cw-dev-harnholm-datasets.s3.us-east-14a.amazonaws.com', which
ENOTFOUNDs in DNS.
Thread the credential's override_url and addressing_style through
parse_uri_to_link. When override_url is set, S3Backend builds a URL
against that host (virtual-host by default; addressing_style='path'
yields path-style for localstack/MinIO without wildcard DNS). AWS S3
behavior (no override_url) is unchanged. Other backends accept the
new kwargs but ignore them.
Updated callsites:
- data_service.py: dataset listing payload now respects bucket
config's default_credential.override_url. This is the immediate
UI fix.
- dataset/uploading.py (×2) + migrating.py: new uploads/migrations
record the correct URL in the manifest. Existing manifest entries
persisted with AWS-pattern URLs are not migrated (separate concern).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t private buckets (#793) The UI's fetchManifest and file preview proxy performed unsigned fetch() against S3 HTTPS URLs, which fails with 403 on private buckets. Added two service-side proxy endpoints: - GET /{bucket}/dataset/{name}/manifest — reads manifest JSON from storage using bucket credentials (supports S3, GCS, Azure, Swift, TOS) - GET /{bucket}/dataset/{name}/file-content — streams individual file content with storage_path validation against dataset container Updated UI to call these endpoints through the existing /api catch-all proxy instead of direct unsigned fetch. Removed unused fetchManifest server action files.
The file preview panel sends a HEAD request before GET to check content-type and access. FastAPI's @router.get does not handle HEAD, returning 405 Method Not Allowed. Changed to @router.api_route with both GET and HEAD methods.
fetchDatasetFiles used a relative fetch('/api/bucket/...') which fails
during SSR because the request goes through the proxy without browser
auth cookies, resulting in 403 from the API gateway.
Re-introduced the server action pattern: fetchManifest now calls the
backend service directly using getServerApiBaseUrl() (internal URL),
bypassing the auth gateway. Works for both SSR and client hydration.
…oints accepted no positional args. On main, the public overloads now require a remote_path argument. Single-object fetches should go through storage.SingleObjectClient (already used by cli/dataset.py and lib/data/dataset/downloading.py) — its get_object_stream() takes no path because the path is bound at create() time from storage_uri. mypy on data_service.py now passes; behavior is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Hans Arnholm <harnholm@nvidia.com>
The server-action call to /api/bucket/{bucket}/dataset/{name}/manifest
went out without the incoming request's session cookie, so oauth2-proxy
returned 401 with "No valid authentication in request" and the dataset
detail page crashed with "Failed to fetch manifest: 401".
getServerFetchHeaders() already exists in config.production.ts for
exactly this purpose — it pulls authorization + cookie from the SSR
request via next/headers and forwards them upstream. Use it.
The original docstring claimed the server action would "bypass the API
gateway auth layer" by hitting an internal service URL, but
getServerApiBaseUrl() returns the public hostname (Envoy + oauth2-proxy
front), so requests still pass through the gateway. Forwarding the
caller's session is the simpler fix and doesn't require a new internal
URL knob in deployment config.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Hans Arnholm <harnholm@nvidia.com>
…uessing storage_path is hash-keyed (e.g. .../hashes/<etag>) and carries no extension. mimetypes.guess_type(storage_path) returned None, the service fell back to application/octet-stream, and the preview panel rendered every text file as a binary download. Forward the original relative_path through the proxy as a 'filename' query param. The service uses it for media-type guessing only; access control still hinges on storage_path's container check. Falls through cleanly when filename is absent (preserves current behavior for any legacy caller). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Hans Arnholm <harnholm@nvidia.com>
The conflict resolution when cherry-picking #795 left a multi-line generateFlatManifest call that prettier wants on one line under the project's print-width settings. Pure formatting; no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Hans Arnholm <harnholm@nvidia.com>
Four review-driven fixes that all touched the dataset preview path: 1. parse_uri_to_link preserves override_url base path (CAIOS via reverse proxy). Previously dropped parsed.path; an override of 'https://gateway.example.com/s3' produced URLs against gateway.example.com with no /s3 segment. Now host + base_path are preserved through both addressing styles, including the no-scheme-prefix case where urlparse drops the whole input into 'path'. 2. get_file_content validates against dataset hash_location prefix, not just the bucket container. This was the unresolved CodeRabbit feedback on #795 — container-only matching let an authenticated caller request any object in the same bucket via the dataset endpoint. 3. Forward Range request and accept-ranges/content-range response headers through the proxy so 206 Partial Content semantics survive (needed for video/large-file preview). 4. Remove the legacy 'url=' fallback from the proxy. It accepted any http(s) URL — a clear SSRF vector (cloud metadata endpoints, RFC1918, loopback). Manifests since #795 always carry storage_path; the legacy path was effectively dead. UI's file-preview-panel updated to drop the matching client-side fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Hans Arnholm <harnholm@nvidia.com>
Two review fixes for the auth-sensitive dataset preview path:
1. Stop forwarding upstream Cache-Control on /proxy/dataset/file. Storage
providers commonly set 'public, max-age=…' on object responses; passing
that through on a per-user authenticated route lets intermediate caches
serve one user's bytes to another. Drop 'cache-control' from
FORWARDED_RESPONSE_HEADERS and explicitly set 'Cache-Control:
private, no-store' on every response. ('vary' was never forwarded — left
as is.)
2. Drop staleTime: Infinity from the file-preview HEAD useQuery. Caching a
401/403 forever would prevent revalidation after the user's session
refreshes. staleTime: 0 restores standard React Query revalidation
semantics (refetch on focus/mount when stale).
The text-preview useQuery one block up still has staleTime: Infinity; left
alone here since the reviewer scoped the finding to the HEAD query, but
worth a follow-up if we see stuck 401s on the text preview specifically.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Hans Arnholm <harnholm@nvidia.com>
2d13ce3 to
b9e76a1
Compare
…set link
query_dataset's metadata-enabled branch was the third parse_uri_to_link
callsite I missed when threading override_url + addressing_style for
private S3-compatible buckets — get_collection_info and get_dataset_info
already do this. Without it, /api/bucket/{bucket}/query results
on a CAIOS-backed bucket return AWS-pattern hostnames that ENOTFOUND on
the UI side, same failure mode the original parse_uri_to_link bug had.
Mirrors the default_credential fallback pattern (override_url and
addressing_style live on bucket_config.default_credential, not directly
on bucket_config).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Hans Arnholm <harnholm@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/service/core/data/data_service.py (1)
1009-1035: 💤 Low valueConsider adding a more specific return type annotation.
The return type
-> Listis quite generic. For better type safety and API documentation, consider specifying the manifest entry structure:-) -> List: +) -> List[Dict[str, Any]]:Or define a Pydantic model for manifest entries if the structure is well-defined.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/service/core/data/data_service.py` around lines 1009 - 1035, The function get_manifest currently uses a generic return annotation -> List; change it to a specific type (e.g., List[ManifestEntry] or List[Dict[str, Any]) to improve type safety and docs: define a Pydantic model ManifestEntry (subclass pydantic.BaseModel) capturing the manifest fields or choose typing.List[Dict[str, Any]] if structure is dynamic, update the function signature from def get_manifest(...) -> List to def get_manifest(...) -> List[ManifestEntry] (or List[Dict[str, Any]]), and replace the raw json.loads(manifest_content) return with parsing to the chosen type (e.g., pydantic.parse_obj_as(List[ManifestEntry], json.loads(...)) or cast to List[Dict[str, Any]]); add the necessary imports for typing or pydantic and keep existing logic in get_manifest unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/service/core/data/data_service.py`:
- Around line 1009-1035: The function get_manifest currently uses a generic
return annotation -> List; change it to a specific type (e.g.,
List[ManifestEntry] or List[Dict[str, Any]) to improve type safety and docs:
define a Pydantic model ManifestEntry (subclass pydantic.BaseModel) capturing
the manifest fields or choose typing.List[Dict[str, Any]] if structure is
dynamic, update the function signature from def get_manifest(...) -> List to def
get_manifest(...) -> List[ManifestEntry] (or List[Dict[str, Any]]), and replace
the raw json.loads(manifest_content) return with parsing to the chosen type
(e.g., pydantic.parse_obj_as(List[ManifestEntry], json.loads(...)) or cast to
List[Dict[str, Any]]); add the necessary imports for typing or pydantic and keep
existing logic in get_manifest unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a0f768d9-9261-4c0e-8ddf-e321780c6e62
📒 Files selected for processing (1)
src/service/core/data/data_service.py
Description
Completes the CAIOS / private-S3-compatible-bucket unblock work started in #950, focused on the UI side. After #950, the CLI and workflow runtime worked against CAIOS, but the dataset detail page in the UI was still unusable. This PR fixes the three distinct failures that surfaced — URL construction, gateway auth on SSR fetches, and the unsigned-fetch-against-private-bucket pattern that #795 was already addressing.
Issue #950
Checklist
Summary by CodeRabbit
New Features
Refactor
Tests / Mocks
Chores