fix: dataset: fix browsing datasets from private buckets by cypres · Pull Request #957 · NVIDIA/OSMO

cypres · 2026-05-05T22:59:09Z

Description

Completes the CAIOS / private-S3-compatible-bucket unblock work started in #950, focused on the UI side. After #950, the CLI and workflow runtime worked against CAIOS, but the dataset detail page in the UI was still unusable. This PR fixes the three distinct failures that surfaced — URL construction, gateway auth on SSR fetches, and the unsigned-fetch-against-private-bucket pattern that #795 was already addressing.

Issue #950

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

Summary by CodeRabbit

New Features
- Added dataset manifest endpoint and file-content streaming with MIME detection.
- Production dataset file proxy route supporting bucket/name/storagePath and filename passthrough; GET/HEAD proxying with access protections.
Refactor
- Storage link construction now respects per-bucket data credentials, override hosts, and addressing styles.
- File preview and dataset listing now operate with explicit bucket/name/version and proxy-aware preview flow.
Tests / Mocks
- Added S3 URL-construction tests and a mock manifest endpoint for private buckets.
Chores
- Removed legacy direct-manifest fetch path.

coderabbitai · 2026-05-05T22:59:23Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Extends storage backend link generation to accept per-bucket credential overrides (override_url, addressing_style), threads those into dataset upload/migration and service-layer URL construction, adds manifest and file-content endpoints, and updates frontend proxy, preview UI, hooks, tests, and mocks to use credential-aware proxied storage URLs.

Changes

Credential-scoped link building → service-proxied manifest & preview

Layer / File(s)	Summary
Storage Backend Interface `src/lib/data/storage/backends/common.py`	`parse_uri_to_link(region, *, override_url: str
Backend Implementations `src/lib/data/storage/backends/backends.py`	Swift, S3, GS, TOS, Azure backends updated to the new signature; S3 implements `override_url` handling (parses override host/base-path and builds path-style or virtual-hosted URLs based on `addressing_style`).
Backend Tests `src/lib/data/storage/backends/tests/test_backends.py`	Added `S3BackendParseUriToLinkTest` (six tests) validating S3 `parse_uri_to_link` behavior for default AWS pattern, override virtual-host, path-style addressing, scheme preservation, and base-path handling.
Dataset Workers / Uploads `src/lib/data/dataset/migrating.py`, `src/lib/data/dataset/uploading.py`	Workers now resolve `destination.resolved_data_credential` and pass `override_url`/`addressing_style` into `parse_uri_to_link` when composing manifest entry URLs (replaces region-only calls).
Service Layer `src/service/core/data/data_service.py`	`get_collection_info`/`get_dataset_info` now derive credential override fields from `bucket_config.default_credential` and pass them to `parse_uri_to_link`. New endpoints added: `GET /{bucket}/dataset/{name}/manifest` (returns manifest JSON) and `GET /{bucket}/dataset/{name}/file-content` (streams object bytes with MIME detection).
UI Proxy Route (production) `src/ui/src/app/proxy/dataset/file/route.impl.production.ts`	Replaced URL-based proxy with NextRequest handlers requiring `bucket`, `name`, `storagePath` (optional `filename`), forwarding selected headers and routing to backend `/file-content` to avoid SSRF and centralize upstream fetch logic.
Dataset Detail & Preview UI `src/ui/src/features/datasets/detail/components/dataset-detail-content.tsx`, `src/ui/src/features/datasets/detail/components/file-preview-panel.tsx`	`DatasetDetailContent` exposes `resolvedName`/`resolvedVersion`. `FilePreviewPanel` added `bucket` and `datasetName` props, switched to storage-path proxy URLs (`toProxyUrl(bucket,datasetName,file)`), updated head preflight (`fetchHeadResult`) to use bucket/dataset scoped proxying, and changed preview state to carry `proxyUrl`.
Client-side API / Hooks `src/ui/src/lib/api/adapter/datasets.ts`, `src/ui/src/lib/api/adapter/datasets-hooks.ts`, `src/ui/src/lib/api/server/dataset-actions.production.ts`	`fetchDatasetFiles` and `useDatasetFiles` signatures changed to accept `bucket`, `name`, `version` and call the internal manifest API. Production `fetchManifest` refactored to accept bucket/name/version and perform server-side fetch with forwarded headers; the old mixed mock/production module was removed.
Mocks & Build Alias `src/ui/src/mocks/handlers.ts`, `src/ui/next.config.ts`	Added MSW handler `GET /api/bucket/:bucket/dataset/:name/manifest` for private-bucket manifests. Updated Turbopack production alias to point to the app API route implementation and removed the old dataset-actions production alias.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Browser as User Browser (UI)
  participant Preview as FilePreviewPanel
  participant Frontend as Next App Proxy
  participant Service as Data Service
  participant Storage as Storage Backend

  Browser->>Preview: open preview (bucket, datasetName, file)
  Preview->>Frontend: HEAD/GET /app/proxy?bucket&name&storagePath
  Frontend->>Service: GET /{bucket}/dataset/{name}/file-content (forwards auth)
  Service->>Storage: storage client reads object using resolved credential (override_url, addressing_style)
  Storage-->>Service: object bytes + metadata
  Service-->>Frontend: proxied response with content headers
  Frontend-->>Preview: proxied response
  Preview-->>Browser: render preview using proxy URL

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

"A rabbit nibbles code by lamplight,
threads credentials into every byte.
Buckets whisper paths, proxies hum,
manifests sing and previews come. 🐇✨"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 65.22% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title clearly summarizes the main change: fixing dataset browsing functionality for private buckets, which aligns with the comprehensive changeset addressing URL construction, authentication, and proxy routing for private S3-compatible storage backends.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch harnholm/cw-storage-fixes

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-05T23:06:34Z

Codecov Report

❌ Patch coverage is 29.68750% with 45 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.93%. Comparing base (c87cc34) to head (7f99f57).

Files with missing lines	Patch %	Lines
src/service/core/data/data_service.py	14.28%	30 Missing ⚠️
...i/src/lib/api/server/dataset-actions.production.ts	0.00%	5 Missing ⚠️
src/lib/data/storage/backends/backends.py	76.47%	3 Missing and 1 partial ⚠️
src/ui/src/lib/api/adapter/datasets.ts	0.00%	3 Missing ⚠️
src/lib/data/dataset/migrating.py	0.00%	1 Missing ⚠️
src/lib/data/dataset/uploading.py	0.00%	1 Missing ⚠️
src/ui/src/lib/api/adapter/datasets-hooks.ts	0.00%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #957   +/-   ##
=======================================
  Coverage   42.93%   42.93%           
=======================================
  Files         218      217    -1     
  Lines       28458    28489   +31     
  Branches     4255     4256    +1     
=======================================
+ Hits        12218    12232   +14     
- Misses      15599    15615   +16     
- Partials      641      642    +1

Flag	Coverage Δ
backend	`43.68% <34.54%> (-0.03%)`	⬇️
ui	`34.60% <0.00%> (+0.26%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/lib/data/storage/backends/common.py	`84.44% <100.00%> (ø)`
src/lib/data/dataset/migrating.py	`41.53% <0.00%> (-0.65%)`	⬇️
src/lib/data/dataset/uploading.py	`27.27% <0.00%> (-0.26%)`	⬇️
src/ui/src/lib/api/adapter/datasets-hooks.ts	`0.00% <0.00%> (ø)`
src/ui/src/lib/api/adapter/datasets.ts	`26.66% <0.00%> (ø)`
src/lib/data/storage/backends/backends.py	`57.60% <76.47%> (+0.53%)`	⬆️
...i/src/lib/api/server/dataset-actions.production.ts	`0.00% <0.00%> (ø)`
src/service/core/data/data_service.py	`10.90% <14.28%> (+0.23%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/data/storage/backends/backends.py`:
- Around line 481-487: The URL-building branch that handles override_url
currently drops any path prefix (parsed.path) and thus breaks reverse-proxied S3
endpoints; update the logic that builds the return strings (both the path-style
and virtual-host-style branches using addressing_style, scheme, host,
self.container, self.path) to include parsed.path.rstrip('/') as a base path
prefix (ensure you insert a single '/' between the base path and subsequent
segments and still call .rstrip('/') on the final URL) so that an override like
"https://gateway.example.com/s3" yields
"https://gateway.example.com/s3/<container>/<path>" (path-style) or
"https://<container>.gateway.example.com/s3/<path>" (virtual-host-style) as
appropriate.

In `@src/service/core/data/data_service.py`:
- Around line 1051-1056: The current check only compares
requested_backend.container to dataset_backend.container which allows any object
in the same container; update validation in the code that calls
storage.construct_storage_backend (where requested_backend, dataset_backend,
storage_path and dataset_info.hash_location are used) to also verify that the
requested storage_path begins with the dataset_info.hash_location prefix (or its
normalized equivalent) before allowing access, and raise the same
osmo_errors.OSMOUserError if the path is outside that prefix; ensure you
normalize/strip leading slashes or URL schemes consistently when comparing
prefixes so comparisons are robust across inputs.

In `@src/ui/src/app/proxy/dataset/file/route.impl.production.ts`:
- Around line 72-80: The legacy URL fallback in route.impl.production.ts (the
searchParams.get("url") branch that sets the url variable and returns it) is an
SSRF risk; either remove this fallback entirely or tighten validation: parse the
provided URL, require http(s), then enforce a strict allowlist of known storage
hostnames/domains (and optionally specific ports) and reject anything else with
a 400; if you must keep broader support also perform DNS/IP resolution and block
private/loopback/metadata IP ranges (169.254.x.x, 127.x.x.x, 10/172.16/192.168,
::1, etc.) before returning { url } so only explicitly permitted storage hosts
can be fetched.
- Around line 35-44: The proxy currently drops range semantics: update the
request-forwarding logic to include the incoming "range" header when building
the upstream request (read request.headers.get("range") and set it on the
fetch/init headers) and extend the response header passthrough so
forwardResponseHeaders (and the equivalent block at 87-100) also returns
"accept-ranges" and "content-range" (add those names to FORWARDED_HEADERS or
explicitly copy them) so 206/Range responses are preserved through the proxy.

In `@src/ui/src/lib/api/adapter/datasets.ts`:
- Around line 510-519: fetchDatasetFiles currently imports the server-only
module "@/lib/api/server/dataset-actions.production" (and its fetchManifest)
which breaks when called from the client via useDatasetFiles; instead, replace
that server import with a client-side HTTP call to the manifest API endpoint
(e.g. fetch(`/api/bucket/${bucket}/dataset/${name}/version/${version}/manifest`)
or your equivalent route), await and parse the JSON into RawFileItem[] and then
pass it to processManifestItems; keep the early-return for null location and
ensure the fetch includes any required credentials/headers for the client
request.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2896f6e1-d95c-4993-a21e-0aa53f0ca5b3

📥 Commits

Reviewing files that changed from the base of the PR and between 1962707 and f21e72e.

📒 Files selected for processing (15)

src/lib/data/dataset/migrating.py
src/lib/data/dataset/uploading.py
src/lib/data/storage/backends/backends.py
src/lib/data/storage/backends/common.py
src/lib/data/storage/backends/tests/test_backends.py
src/service/core/data/data_service.py
src/ui/next.config.ts
src/ui/src/app/proxy/dataset/file/route.impl.production.ts
src/ui/src/features/datasets/detail/components/dataset-detail-content.tsx
src/ui/src/features/datasets/detail/components/file-preview-panel.tsx
src/ui/src/lib/api/adapter/datasets-hooks.ts
src/ui/src/lib/api/adapter/datasets.ts
src/ui/src/lib/api/server/dataset-actions.production.ts
src/ui/src/lib/api/server/dataset-actions.ts
src/ui/src/mocks/handlers.ts

💤 Files with no reviewable changes (2)

src/ui/src/lib/api/server/dataset-actions.ts
src/ui/next.config.ts

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/service/core/data/data_service.py (1)

1265-1273: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Inconsistent parse_uri_to_link call missing override parameters.

The query_dataset function's metadata-enabled path calls parse_uri_to_link with only the region parameter, unlike get_collection_info (line 134) and get_dataset_info (line 212) which now pass override_url and addressing_style. This would produce incorrect URLs for private S3-compatible buckets when querying datasets.

Proposed fix for consistency

+    default_cred = bucket_config.default_credential
+    override_url = default_cred.override_url if default_cred else None
+    addressing_style = default_cred.addressing_style if default_cred else None
+
     if query_term.metadata_enabled:
         for row in dataset_rows:
             dataset_infos.append(objects.DataInfoDatasetEntry(
                 ...
                 location=storage.construct_storage_backend(row.location)\
-                    .parse_uri_to_link(bucket_config.region),
+                    .parse_uri_to_link(
+                        bucket_config.region,
+                        override_url=override_url,
+                        addressing_style=addressing_style,
+                    ),
                 ...

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/service/core/data/data_service.py` around lines 1265 - 1273, The
metadata-enabled branch in query_dataset constructs the storage backend link
using parse_uri_to_link with only bucket_config.region; update this call to pass
the same override parameters as get_collection_info and get_dataset_info by
including bucket_config.override_url and bucket_config.addressing_style when
calling storage.construct_storage_backend(...).parse_uri_to_link so private
S3-compatible buckets build correct URLs; locate the call in query_dataset and
mirror the parameter order/usage from get_collection_info/get_dataset_info.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/ui/src/app/proxy/dataset/file/route.impl.production.ts`:
- Around line 36-44: The proxy currently forwards upstream cache directives by
including "cache-control" in FORWARDED_RESPONSE_HEADERS; change the proxy logic
so it no longer forwards the upstream Cache-Control (remove "cache-control" from
FORWARDED_RESPONSE_HEADERS) and instead sets the response header to
"Cache-Control: private, no-store" for this authenticated route (and any other
places using the same forwarding list). Also ensure the Vary header is not
propagated from upstream (drop or overwrite "vary") so caching cannot vary by
upstream values; update the response handling code that uses
FORWARDED_RESPONSE_HEADERS to explicitly set these headers for auth-dependent
responses (reference FORWARDED_RESPONSE_HEADERS and the response-forwarding
logic in this route implementation).

In `@src/ui/src/features/datasets/detail/components/file-preview-panel.tsx`:
- Around line 349-358: The query that fetches HEAD results (useQuery with
queryKey ["file-preview-head", proxyUrl] and queryFn -> fetchHeadResult)
currently sets staleTime: Infinity which caches auth-sensitive 401/403 responses
forever; change staleTime to 0 so results are revalidated against current auth
headers (keep enabled: !!previewSource and the same
queryKey/previewSource/fetchHeadResult logic intact).

---

Outside diff comments:
In `@src/service/core/data/data_service.py`:
- Around line 1265-1273: The metadata-enabled branch in query_dataset constructs
the storage backend link using parse_uri_to_link with only bucket_config.region;
update this call to pass the same override parameters as get_collection_info and
get_dataset_info by including bucket_config.override_url and
bucket_config.addressing_style when calling
storage.construct_storage_backend(...).parse_uri_to_link so private
S3-compatible buckets build correct URLs; locate the call in query_dataset and
mirror the parameter order/usage from get_collection_info/get_dataset_info.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: af547cc3-7f82-4d7c-801b-cb373443e8c1

📥 Commits

Reviewing files that changed from the base of the PR and between 9c62d8a and 2d13ce3.

📒 Files selected for processing (5)

src/lib/data/storage/backends/backends.py
src/lib/data/storage/backends/tests/test_backends.py
src/service/core/data/data_service.py
src/ui/src/app/proxy/dataset/file/route.impl.production.ts
src/ui/src/features/datasets/detail/components/file-preview-panel.tsx

…le backends S3Backend.parse_uri_to_link hardcoded the AWS pattern 'https://<bucket>.s3.<region>.amazonaws.com/...' regardless of credential override_url. The UI consumes these URLs to fetch dataset content; for a CAIOS-backed bucket this produced 'osmo-on-cw-dev-harnholm-datasets.s3.us-east-14a.amazonaws.com', which ENOTFOUNDs in DNS. Thread the credential's override_url and addressing_style through parse_uri_to_link. When override_url is set, S3Backend builds a URL against that host (virtual-host by default; addressing_style='path' yields path-style for localstack/MinIO without wildcard DNS). AWS S3 behavior (no override_url) is unchanged. Other backends accept the new kwargs but ignore them. Updated callsites: - data_service.py: dataset listing payload now respects bucket config's default_credential.override_url. This is the immediate UI fix. - dataset/uploading.py (×2) + migrating.py: new uploads/migrations record the correct URL in the manifest. Existing manifest entries persisted with AWS-pattern URLs are not migrated (separate concern). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…t private buckets (#793) The UI's fetchManifest and file preview proxy performed unsigned fetch() against S3 HTTPS URLs, which fails with 403 on private buckets. Added two service-side proxy endpoints: - GET /{bucket}/dataset/{name}/manifest — reads manifest JSON from storage using bucket credentials (supports S3, GCS, Azure, Swift, TOS) - GET /{bucket}/dataset/{name}/file-content — streams individual file content with storage_path validation against dataset container Updated UI to call these endpoints through the existing /api catch-all proxy instead of direct unsigned fetch. Removed unused fetchManifest server action files.

The file preview panel sends a HEAD request before GET to check content-type and access. FastAPI's @router.get does not handle HEAD, returning 405 Method Not Allowed. Changed to @router.api_route with both GET and HEAD methods.

fetchDatasetFiles used a relative fetch('/api/bucket/...') which fails during SSR because the request goes through the proxy without browser auth cookies, resulting in 403 from the API gateway. Re-introduced the server action pattern: fetchManifest now calls the backend service directly using getServerApiBaseUrl() (internal URL), bypassing the auth gateway. Works for both SSR and client hydration.

…oints accepted no positional args. On main, the public overloads now require a remote_path argument. Single-object fetches should go through storage.SingleObjectClient (already used by cli/dataset.py and lib/data/dataset/downloading.py) — its get_object_stream() takes no path because the path is bound at create() time from storage_uri. mypy on data_service.py now passes; behavior is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Hans Arnholm <harnholm@nvidia.com>

The server-action call to /api/bucket/{bucket}/dataset/{name}/manifest went out without the incoming request's session cookie, so oauth2-proxy returned 401 with "No valid authentication in request" and the dataset detail page crashed with "Failed to fetch manifest: 401". getServerFetchHeaders() already exists in config.production.ts for exactly this purpose — it pulls authorization + cookie from the SSR request via next/headers and forwards them upstream. Use it. The original docstring claimed the server action would "bypass the API gateway auth layer" by hitting an internal service URL, but getServerApiBaseUrl() returns the public hostname (Envoy + oauth2-proxy front), so requests still pass through the gateway. Forwarding the caller's session is the simpler fix and doesn't require a new internal URL knob in deployment config. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Hans Arnholm <harnholm@nvidia.com>

…uessing storage_path is hash-keyed (e.g. .../hashes/<etag>) and carries no extension. mimetypes.guess_type(storage_path) returned None, the service fell back to application/octet-stream, and the preview panel rendered every text file as a binary download. Forward the original relative_path through the proxy as a 'filename' query param. The service uses it for media-type guessing only; access control still hinges on storage_path's container check. Falls through cleanly when filename is absent (preserves current behavior for any legacy caller). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Hans Arnholm <harnholm@nvidia.com>

The conflict resolution when cherry-picking #795 left a multi-line generateFlatManifest call that prettier wants on one line under the project's print-width settings. Pure formatting; no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Hans Arnholm <harnholm@nvidia.com>

Four review-driven fixes that all touched the dataset preview path: 1. parse_uri_to_link preserves override_url base path (CAIOS via reverse proxy). Previously dropped parsed.path; an override of 'https://gateway.example.com/s3' produced URLs against gateway.example.com with no /s3 segment. Now host + base_path are preserved through both addressing styles, including the no-scheme-prefix case where urlparse drops the whole input into 'path'. 2. get_file_content validates against dataset hash_location prefix, not just the bucket container. This was the unresolved CodeRabbit feedback on #795 — container-only matching let an authenticated caller request any object in the same bucket via the dataset endpoint. 3. Forward Range request and accept-ranges/content-range response headers through the proxy so 206 Partial Content semantics survive (needed for video/large-file preview). 4. Remove the legacy 'url=' fallback from the proxy. It accepted any http(s) URL — a clear SSRF vector (cloud metadata endpoints, RFC1918, loopback). Manifests since #795 always carry storage_path; the legacy path was effectively dead. UI's file-preview-panel updated to drop the matching client-side fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Hans Arnholm <harnholm@nvidia.com>

Two review fixes for the auth-sensitive dataset preview path: 1. Stop forwarding upstream Cache-Control on /proxy/dataset/file. Storage providers commonly set 'public, max-age=…' on object responses; passing that through on a per-user authenticated route lets intermediate caches serve one user's bytes to another. Drop 'cache-control' from FORWARDED_RESPONSE_HEADERS and explicitly set 'Cache-Control: private, no-store' on every response. ('vary' was never forwarded — left as is.) 2. Drop staleTime: Infinity from the file-preview HEAD useQuery. Caching a 401/403 forever would prevent revalidation after the user's session refreshes. staleTime: 0 restores standard React Query revalidation semantics (refetch on focus/mount when stale). The text-preview useQuery one block up still has staleTime: Infinity; left alone here since the reviewer scoped the finding to the HEAD query, but worth a follow-up if we see stuck 401s on the text preview specifically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Hans Arnholm <harnholm@nvidia.com>

…set link query_dataset's metadata-enabled branch was the third parse_uri_to_link callsite I missed when threading override_url + addressing_style for private S3-compatible buckets — get_collection_info and get_dataset_info already do this. Without it, /api/bucket/{bucket}/query results on a CAIOS-backed bucket return AWS-pattern hostnames that ENOTFOUND on the UI side, same failure mode the original parse_uri_to_link bug had. Mirrors the default_credential fallback pattern (override_url and addressing_style live on bucket_config.default_credential, not directly on bucket_config). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Hans Arnholm <harnholm@nvidia.com>

coderabbitai

🧹 Nitpick comments (1)

src/service/core/data/data_service.py (1)

1009-1035: 💤 Low value

Consider adding a more specific return type annotation.

The return type -> List is quite generic. For better type safety and API documentation, consider specifying the manifest entry structure:

-) -> List:
+) -> List[Dict[str, Any]]:

Or define a Pydantic model for manifest entries if the structure is well-defined.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/service/core/data/data_service.py` around lines 1009 - 1035, The function
get_manifest currently uses a generic return annotation -> List; change it to a
specific type (e.g., List[ManifestEntry] or List[Dict[str, Any]) to improve type
safety and docs: define a Pydantic model ManifestEntry (subclass
pydantic.BaseModel) capturing the manifest fields or choose
typing.List[Dict[str, Any]] if structure is dynamic, update the function
signature from def get_manifest(...) -> List to def get_manifest(...) ->
List[ManifestEntry] (or List[Dict[str, Any]]), and replace the raw
json.loads(manifest_content) return with parsing to the chosen type (e.g.,
pydantic.parse_obj_as(List[ManifestEntry], json.loads(...)) or cast to
List[Dict[str, Any]]); add the necessary imports for typing or pydantic and keep
existing logic in get_manifest unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/service/core/data/data_service.py`:
- Around line 1009-1035: The function get_manifest currently uses a generic
return annotation -> List; change it to a specific type (e.g.,
List[ManifestEntry] or List[Dict[str, Any]) to improve type safety and docs:
define a Pydantic model ManifestEntry (subclass pydantic.BaseModel) capturing
the manifest fields or choose typing.List[Dict[str, Any]] if structure is
dynamic, update the function signature from def get_manifest(...) -> List to def
get_manifest(...) -> List[ManifestEntry] (or List[Dict[str, Any]]), and replace
the raw json.loads(manifest_content) return with parsing to the chosen type
(e.g., pydantic.parse_obj_as(List[ManifestEntry], json.loads(...)) or cast to
List[Dict[str, Any]]); add the necessary imports for typing or pydantic and keep
existing logic in get_manifest unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a0f768d9-9261-4c0e-8ddf-e321780c6e62

📥 Commits

Reviewing files that changed from the base of the PR and between b9e76a1 and 7f99f57.

📒 Files selected for processing (1)

src/service/core/data/data_service.py

cypres requested a review from a team as a code owner May 5, 2026 22:59

cypres had a problem deploying to internal-ci May 5, 2026 22:59 — with GitHub Actions Error

cypres changed the title ~~Harnholm/cw storage fixes~~ fix: dataset: fix browsing datasets from private buckets May 5, 2026

cypres temporarily deployed to internal-ci May 5, 2026 23:03 — with GitHub Actions Inactive

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

cypres temporarily deployed to internal-ci May 5, 2026 23:27 — with GitHub Actions Inactive

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

Comment thread src/ui/src/app/proxy/dataset/file/route.impl.production.ts

Comment thread src/ui/src/features/datasets/detail/components/file-preview-panel.tsx

cypres and others added 10 commits May 5, 2026 16:41

cypres force-pushed the harnholm/cw-storage-fixes branch from 2d13ce3 to b9e76a1 Compare May 5, 2026 23:50

cypres temporarily deployed to internal-ci May 5, 2026 23:51 — with GitHub Actions Inactive

cypres temporarily deployed to internal-ci May 6, 2026 00:01 — with GitHub Actions Inactive

coderabbitai Bot reviewed May 6, 2026

View reviewed changes

This was referenced May 6, 2026

bug/feat: osmo dataset upload fails on S3-compatible backends that reject path-style GetBucketLocation (CAIOS, R2-strict, Wasabi-strict) #940

Closed

fix: route dataset manifest and file preview through service for private buckets #795

Closed

cypres enabled auto-merge (squash) May 6, 2026 00:20

jiaenren approved these changes May 6, 2026

View reviewed changes

fernandol-nvidia approved these changes May 6, 2026

View reviewed changes

cypres merged commit 81ca71c into main May 6, 2026
12 checks passed

cypres deleted the harnholm/cw-storage-fixes branch May 6, 2026 17:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: dataset: fix browsing datasets from private buckets#957

fix: dataset: fix browsing datasets from private buckets#957
cypres merged 11 commits intomainfrom
harnholm/cw-storage-fixes

cypres commented May 5, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 5, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented May 5, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

cypres commented May 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cypres commented May 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 5, 2026 •

edited

Loading

codecov Bot commented May 5, 2026 •

edited

Loading