Skip to content

fix: dataset: fix browsing datasets from private buckets#957

Merged
cypres merged 11 commits intomainfrom
harnholm/cw-storage-fixes
May 6, 2026
Merged

fix: dataset: fix browsing datasets from private buckets#957
cypres merged 11 commits intomainfrom
harnholm/cw-storage-fixes

Conversation

@cypres
Copy link
Copy Markdown
Member

@cypres cypres commented May 5, 2026

Description

Completes the CAIOS / private-S3-compatible-bucket unblock work started in #950, focused on the UI side. After #950, the CLI and workflow runtime worked against CAIOS, but the dataset detail page in the UI was still unusable. This PR fixes the three distinct failures that surfaced — URL construction, gateway auth on SSR fetches, and the unsigned-fetch-against-private-bucket pattern that #795 was already addressing.

Issue #950

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Summary by CodeRabbit

  • New Features

    • Added dataset manifest endpoint and file-content streaming with MIME detection.
    • Production dataset file proxy route supporting bucket/name/storagePath and filename passthrough; GET/HEAD proxying with access protections.
  • Refactor

    • Storage link construction now respects per-bucket data credentials, override hosts, and addressing styles.
    • File preview and dataset listing now operate with explicit bucket/name/version and proxy-aware preview flow.
  • Tests / Mocks

    • Added S3 URL-construction tests and a mock manifest endpoint for private buckets.
  • Chores

    • Removed legacy direct-manifest fetch path.

@cypres cypres requested a review from a team as a code owner May 5, 2026 22:59
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 5, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Extends storage backend link generation to accept per-bucket credential overrides (override_url, addressing_style), threads those into dataset upload/migration and service-layer URL construction, adds manifest and file-content endpoints, and updates frontend proxy, preview UI, hooks, tests, and mocks to use credential-aware proxied storage URLs.

Changes

Credential-scoped link building → service-proxied manifest & preview

Layer / File(s) Summary
Storage Backend Interface
src/lib/data/storage/backends/common.py
`parse_uri_to_link(region, *, override_url: str
Backend Implementations
src/lib/data/storage/backends/backends.py
Swift, S3, GS, TOS, Azure backends updated to the new signature; S3 implements override_url handling (parses override host/base-path and builds path-style or virtual-hosted URLs based on addressing_style).
Backend Tests
src/lib/data/storage/backends/tests/test_backends.py
Added S3BackendParseUriToLinkTest (six tests) validating S3 parse_uri_to_link behavior for default AWS pattern, override virtual-host, path-style addressing, scheme preservation, and base-path handling.
Dataset Workers / Uploads
src/lib/data/dataset/migrating.py, src/lib/data/dataset/uploading.py
Workers now resolve destination.resolved_data_credential and pass override_url/addressing_style into parse_uri_to_link when composing manifest entry URLs (replaces region-only calls).
Service Layer
src/service/core/data/data_service.py
get_collection_info/get_dataset_info now derive credential override fields from bucket_config.default_credential and pass them to parse_uri_to_link. New endpoints added: GET /{bucket}/dataset/{name}/manifest (returns manifest JSON) and GET /{bucket}/dataset/{name}/file-content (streams object bytes with MIME detection).
UI Proxy Route (production)
src/ui/src/app/proxy/dataset/file/route.impl.production.ts
Replaced URL-based proxy with NextRequest handlers requiring bucket, name, storagePath (optional filename), forwarding selected headers and routing to backend /file-content to avoid SSRF and centralize upstream fetch logic.
Dataset Detail & Preview UI
src/ui/src/features/datasets/detail/components/dataset-detail-content.tsx, src/ui/src/features/datasets/detail/components/file-preview-panel.tsx
DatasetDetailContent exposes resolvedName/resolvedVersion. FilePreviewPanel added bucket and datasetName props, switched to storage-path proxy URLs (toProxyUrl(bucket,datasetName,file)), updated head preflight (fetchHeadResult) to use bucket/dataset scoped proxying, and changed preview state to carry proxyUrl.
Client-side API / Hooks
src/ui/src/lib/api/adapter/datasets.ts, src/ui/src/lib/api/adapter/datasets-hooks.ts, src/ui/src/lib/api/server/dataset-actions.production.ts
fetchDatasetFiles and useDatasetFiles signatures changed to accept bucket, name, version and call the internal manifest API. Production fetchManifest refactored to accept bucket/name/version and perform server-side fetch with forwarded headers; the old mixed mock/production module was removed.
Mocks & Build Alias
src/ui/src/mocks/handlers.ts, src/ui/next.config.ts
Added MSW handler GET /api/bucket/:bucket/dataset/:name/manifest for private-bucket manifests. Updated Turbopack production alias to point to the app API route implementation and removed the old dataset-actions production alias.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Browser as User Browser (UI)
  participant Preview as FilePreviewPanel
  participant Frontend as Next App Proxy
  participant Service as Data Service
  participant Storage as Storage Backend

  Browser->>Preview: open preview (bucket, datasetName, file)
  Preview->>Frontend: HEAD/GET /app/proxy?bucket&name&storagePath
  Frontend->>Service: GET /{bucket}/dataset/{name}/file-content (forwards auth)
  Service->>Storage: storage client reads object using resolved credential (override_url, addressing_style)
  Storage-->>Service: object bytes + metadata
  Service-->>Frontend: proxied response with content headers
  Frontend-->>Preview: proxied response
  Preview-->>Browser: render preview using proxy URL
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

"A rabbit nibbles code by lamplight,
threads credentials into every byte.
Buckets whisper paths, proxies hum,
manifests sing and previews come. 🐇✨"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 65.22% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly summarizes the main change: fixing dataset browsing functionality for private buckets, which aligns with the comprehensive changeset addressing URL construction, authentication, and proxy routing for private S3-compatible storage backends.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch harnholm/cw-storage-fixes

Comment @coderabbitai help to get the list of available commands and usage tips.

@cypres cypres changed the title Harnholm/cw storage fixes fix: dataset: fix browsing datasets from private buckets May 5, 2026
@cypres cypres temporarily deployed to internal-ci May 5, 2026 23:03 — with GitHub Actions Inactive
@codecov
Copy link
Copy Markdown

codecov Bot commented May 5, 2026

Codecov Report

❌ Patch coverage is 29.68750% with 45 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.93%. Comparing base (c87cc34) to head (7f99f57).

Files with missing lines Patch % Lines
src/service/core/data/data_service.py 14.28% 30 Missing ⚠️
...i/src/lib/api/server/dataset-actions.production.ts 0.00% 5 Missing ⚠️
src/lib/data/storage/backends/backends.py 76.47% 3 Missing and 1 partial ⚠️
src/ui/src/lib/api/adapter/datasets.ts 0.00% 3 Missing ⚠️
src/lib/data/dataset/migrating.py 0.00% 1 Missing ⚠️
src/lib/data/dataset/uploading.py 0.00% 1 Missing ⚠️
src/ui/src/lib/api/adapter/datasets-hooks.ts 0.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #957   +/-   ##
=======================================
  Coverage   42.93%   42.93%           
=======================================
  Files         218      217    -1     
  Lines       28458    28489   +31     
  Branches     4255     4256    +1     
=======================================
+ Hits        12218    12232   +14     
- Misses      15599    15615   +16     
- Partials      641      642    +1     
Flag Coverage Δ
backend 43.68% <34.54%> (-0.03%) ⬇️
ui 34.60% <0.00%> (+0.26%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/lib/data/storage/backends/common.py 84.44% <100.00%> (ø)
src/lib/data/dataset/migrating.py 41.53% <0.00%> (-0.65%) ⬇️
src/lib/data/dataset/uploading.py 27.27% <0.00%> (-0.26%) ⬇️
src/ui/src/lib/api/adapter/datasets-hooks.ts 0.00% <0.00%> (ø)
src/ui/src/lib/api/adapter/datasets.ts 26.66% <0.00%> (ø)
src/lib/data/storage/backends/backends.py 57.60% <76.47%> (+0.53%) ⬆️
...i/src/lib/api/server/dataset-actions.production.ts 0.00% <0.00%> (ø)
src/service/core/data/data_service.py 10.90% <14.28%> (+0.23%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/data/storage/backends/backends.py`:
- Around line 481-487: The URL-building branch that handles override_url
currently drops any path prefix (parsed.path) and thus breaks reverse-proxied S3
endpoints; update the logic that builds the return strings (both the path-style
and virtual-host-style branches using addressing_style, scheme, host,
self.container, self.path) to include parsed.path.rstrip('/') as a base path
prefix (ensure you insert a single '/' between the base path and subsequent
segments and still call .rstrip('/') on the final URL) so that an override like
"https://gateway.example.com/s3" yields
"https://gateway.example.com/s3/<container>/<path>" (path-style) or
"https://<container>.gateway.example.com/s3/<path>" (virtual-host-style) as
appropriate.

In `@src/service/core/data/data_service.py`:
- Around line 1051-1056: The current check only compares
requested_backend.container to dataset_backend.container which allows any object
in the same container; update validation in the code that calls
storage.construct_storage_backend (where requested_backend, dataset_backend,
storage_path and dataset_info.hash_location are used) to also verify that the
requested storage_path begins with the dataset_info.hash_location prefix (or its
normalized equivalent) before allowing access, and raise the same
osmo_errors.OSMOUserError if the path is outside that prefix; ensure you
normalize/strip leading slashes or URL schemes consistently when comparing
prefixes so comparisons are robust across inputs.

In `@src/ui/src/app/proxy/dataset/file/route.impl.production.ts`:
- Around line 72-80: The legacy URL fallback in route.impl.production.ts (the
searchParams.get("url") branch that sets the url variable and returns it) is an
SSRF risk; either remove this fallback entirely or tighten validation: parse the
provided URL, require http(s), then enforce a strict allowlist of known storage
hostnames/domains (and optionally specific ports) and reject anything else with
a 400; if you must keep broader support also perform DNS/IP resolution and block
private/loopback/metadata IP ranges (169.254.x.x, 127.x.x.x, 10/172.16/192.168,
::1, etc.) before returning { url } so only explicitly permitted storage hosts
can be fetched.
- Around line 35-44: The proxy currently drops range semantics: update the
request-forwarding logic to include the incoming "range" header when building
the upstream request (read request.headers.get("range") and set it on the
fetch/init headers) and extend the response header passthrough so
forwardResponseHeaders (and the equivalent block at 87-100) also returns
"accept-ranges" and "content-range" (add those names to FORWARDED_HEADERS or
explicitly copy them) so 206/Range responses are preserved through the proxy.

In `@src/ui/src/lib/api/adapter/datasets.ts`:
- Around line 510-519: fetchDatasetFiles currently imports the server-only
module "@/lib/api/server/dataset-actions.production" (and its fetchManifest)
which breaks when called from the client via useDatasetFiles; instead, replace
that server import with a client-side HTTP call to the manifest API endpoint
(e.g. fetch(`/api/bucket/${bucket}/dataset/${name}/version/${version}/manifest`)
or your equivalent route), await and parse the JSON into RawFileItem[] and then
pass it to processManifestItems; keep the early-return for null location and
ensure the fetch includes any required credentials/headers for the client
request.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2896f6e1-d95c-4993-a21e-0aa53f0ca5b3

📥 Commits

Reviewing files that changed from the base of the PR and between 1962707 and f21e72e.

📒 Files selected for processing (15)
  • src/lib/data/dataset/migrating.py
  • src/lib/data/dataset/uploading.py
  • src/lib/data/storage/backends/backends.py
  • src/lib/data/storage/backends/common.py
  • src/lib/data/storage/backends/tests/test_backends.py
  • src/service/core/data/data_service.py
  • src/ui/next.config.ts
  • src/ui/src/app/proxy/dataset/file/route.impl.production.ts
  • src/ui/src/features/datasets/detail/components/dataset-detail-content.tsx
  • src/ui/src/features/datasets/detail/components/file-preview-panel.tsx
  • src/ui/src/lib/api/adapter/datasets-hooks.ts
  • src/ui/src/lib/api/adapter/datasets.ts
  • src/ui/src/lib/api/server/dataset-actions.production.ts
  • src/ui/src/lib/api/server/dataset-actions.ts
  • src/ui/src/mocks/handlers.ts
💤 Files with no reviewable changes (2)
  • src/ui/src/lib/api/server/dataset-actions.ts
  • src/ui/next.config.ts

Comment thread src/lib/data/storage/backends/backends.py Outdated
Comment thread src/service/core/data/data_service.py Outdated
Comment thread src/ui/src/app/proxy/dataset/file/route.impl.production.ts Outdated
Comment thread src/ui/src/app/proxy/dataset/file/route.impl.production.ts Outdated
Comment thread src/ui/src/lib/api/adapter/datasets.ts
@cypres cypres temporarily deployed to internal-ci May 5, 2026 23:27 — with GitHub Actions Inactive
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/service/core/data/data_service.py (1)

1265-1273: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Inconsistent parse_uri_to_link call missing override parameters.

The query_dataset function's metadata-enabled path calls parse_uri_to_link with only the region parameter, unlike get_collection_info (line 134) and get_dataset_info (line 212) which now pass override_url and addressing_style. This would produce incorrect URLs for private S3-compatible buckets when querying datasets.

Proposed fix for consistency
+    default_cred = bucket_config.default_credential
+    override_url = default_cred.override_url if default_cred else None
+    addressing_style = default_cred.addressing_style if default_cred else None
+
     if query_term.metadata_enabled:
         for row in dataset_rows:
             dataset_infos.append(objects.DataInfoDatasetEntry(
                 ...
                 location=storage.construct_storage_backend(row.location)\
-                    .parse_uri_to_link(bucket_config.region),
+                    .parse_uri_to_link(
+                        bucket_config.region,
+                        override_url=override_url,
+                        addressing_style=addressing_style,
+                    ),
                 ...
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/service/core/data/data_service.py` around lines 1265 - 1273, The
metadata-enabled branch in query_dataset constructs the storage backend link
using parse_uri_to_link with only bucket_config.region; update this call to pass
the same override parameters as get_collection_info and get_dataset_info by
including bucket_config.override_url and bucket_config.addressing_style when
calling storage.construct_storage_backend(...).parse_uri_to_link so private
S3-compatible buckets build correct URLs; locate the call in query_dataset and
mirror the parameter order/usage from get_collection_info/get_dataset_info.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/ui/src/app/proxy/dataset/file/route.impl.production.ts`:
- Around line 36-44: The proxy currently forwards upstream cache directives by
including "cache-control" in FORWARDED_RESPONSE_HEADERS; change the proxy logic
so it no longer forwards the upstream Cache-Control (remove "cache-control" from
FORWARDED_RESPONSE_HEADERS) and instead sets the response header to
"Cache-Control: private, no-store" for this authenticated route (and any other
places using the same forwarding list). Also ensure the Vary header is not
propagated from upstream (drop or overwrite "vary") so caching cannot vary by
upstream values; update the response handling code that uses
FORWARDED_RESPONSE_HEADERS to explicitly set these headers for auth-dependent
responses (reference FORWARDED_RESPONSE_HEADERS and the response-forwarding
logic in this route implementation).

In `@src/ui/src/features/datasets/detail/components/file-preview-panel.tsx`:
- Around line 349-358: The query that fetches HEAD results (useQuery with
queryKey ["file-preview-head", proxyUrl] and queryFn -> fetchHeadResult)
currently sets staleTime: Infinity which caches auth-sensitive 401/403 responses
forever; change staleTime to 0 so results are revalidated against current auth
headers (keep enabled: !!previewSource and the same
queryKey/previewSource/fetchHeadResult logic intact).

---

Outside diff comments:
In `@src/service/core/data/data_service.py`:
- Around line 1265-1273: The metadata-enabled branch in query_dataset constructs
the storage backend link using parse_uri_to_link with only bucket_config.region;
update this call to pass the same override parameters as get_collection_info and
get_dataset_info by including bucket_config.override_url and
bucket_config.addressing_style when calling
storage.construct_storage_backend(...).parse_uri_to_link so private
S3-compatible buckets build correct URLs; locate the call in query_dataset and
mirror the parameter order/usage from get_collection_info/get_dataset_info.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: af547cc3-7f82-4d7c-801b-cb373443e8c1

📥 Commits

Reviewing files that changed from the base of the PR and between 9c62d8a and 2d13ce3.

📒 Files selected for processing (5)
  • src/lib/data/storage/backends/backends.py
  • src/lib/data/storage/backends/tests/test_backends.py
  • src/service/core/data/data_service.py
  • src/ui/src/app/proxy/dataset/file/route.impl.production.ts
  • src/ui/src/features/datasets/detail/components/file-preview-panel.tsx

Comment thread src/ui/src/app/proxy/dataset/file/route.impl.production.ts
cypres and others added 10 commits May 5, 2026 16:41
…le backends

S3Backend.parse_uri_to_link hardcoded the AWS pattern
'https://<bucket>.s3.<region>.amazonaws.com/...' regardless of credential
override_url. The UI consumes these URLs to fetch dataset content; for a
CAIOS-backed bucket this produced
'osmo-on-cw-dev-harnholm-datasets.s3.us-east-14a.amazonaws.com', which
ENOTFOUNDs in DNS.

Thread the credential's override_url and addressing_style through
parse_uri_to_link. When override_url is set, S3Backend builds a URL
against that host (virtual-host by default; addressing_style='path'
yields path-style for localstack/MinIO without wildcard DNS). AWS S3
behavior (no override_url) is unchanged. Other backends accept the
new kwargs but ignore them.

Updated callsites:
  - data_service.py: dataset listing payload now respects bucket
    config's default_credential.override_url. This is the immediate
    UI fix.
  - dataset/uploading.py (×2) + migrating.py: new uploads/migrations
    record the correct URL in the manifest. Existing manifest entries
    persisted with AWS-pattern URLs are not migrated (separate concern).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t private buckets (#793)

The UI's fetchManifest and file preview proxy performed unsigned fetch()
against S3 HTTPS URLs, which fails with 403 on private buckets.

Added two service-side proxy endpoints:
- GET /{bucket}/dataset/{name}/manifest — reads manifest JSON from storage
  using bucket credentials (supports S3, GCS, Azure, Swift, TOS)
- GET /{bucket}/dataset/{name}/file-content — streams individual file
  content with storage_path validation against dataset container

Updated UI to call these endpoints through the existing /api catch-all
proxy instead of direct unsigned fetch. Removed unused fetchManifest
server action files.
The file preview panel sends a HEAD request before GET to check
content-type and access. FastAPI's @router.get does not handle HEAD,
returning 405 Method Not Allowed. Changed to @router.api_route with
both GET and HEAD methods.
fetchDatasetFiles used a relative fetch('/api/bucket/...') which fails
during SSR because the request goes through the proxy without browser
auth cookies, resulting in 403 from the API gateway.

Re-introduced the server action pattern: fetchManifest now calls the
backend service directly using getServerApiBaseUrl() (internal URL),
bypassing the auth gateway. Works for both SSR and client hydration.
…oints

accepted no positional args. On main, the public overloads now require a
remote_path argument. Single-object fetches should go through
storage.SingleObjectClient (already used by cli/dataset.py and
lib/data/dataset/downloading.py) — its get_object_stream() takes no path
because the path is bound at create() time from storage_uri.

mypy on data_service.py now passes; behavior is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Hans Arnholm <harnholm@nvidia.com>
The server-action call to /api/bucket/{bucket}/dataset/{name}/manifest
went out without the incoming request's session cookie, so oauth2-proxy
returned 401 with "No valid authentication in request" and the dataset
detail page crashed with "Failed to fetch manifest: 401".

getServerFetchHeaders() already exists in config.production.ts for
exactly this purpose — it pulls authorization + cookie from the SSR
request via next/headers and forwards them upstream. Use it.

The original docstring claimed the server action would "bypass the API
gateway auth layer" by hitting an internal service URL, but
getServerApiBaseUrl() returns the public hostname (Envoy + oauth2-proxy
front), so requests still pass through the gateway. Forwarding the
caller's session is the simpler fix and doesn't require a new internal
URL knob in deployment config.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Hans Arnholm <harnholm@nvidia.com>
…uessing

storage_path is hash-keyed (e.g. .../hashes/<etag>) and carries no
extension. mimetypes.guess_type(storage_path) returned None, the service
fell back to application/octet-stream, and the preview panel rendered
every text file as a binary download.

Forward the original relative_path through the proxy as a 'filename'
query param. The service uses it for media-type guessing only; access
control still hinges on storage_path's container check. Falls through
cleanly when filename is absent (preserves current behavior for any
legacy caller).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Hans Arnholm <harnholm@nvidia.com>
The conflict resolution when cherry-picking #795 left a multi-line
generateFlatManifest call that prettier wants on one line under the
project's print-width settings. Pure formatting; no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Hans Arnholm <harnholm@nvidia.com>
Four review-driven fixes that all touched the dataset preview path:

1. parse_uri_to_link preserves override_url base path (CAIOS via reverse
   proxy). Previously dropped parsed.path; an override of
   'https://gateway.example.com/s3' produced URLs against
   gateway.example.com with no /s3 segment. Now host + base_path are
   preserved through both addressing styles, including the
   no-scheme-prefix case where urlparse drops the whole input into 'path'.

2. get_file_content validates against dataset hash_location prefix, not
   just the bucket container. This was the unresolved CodeRabbit feedback
   on #795 — container-only matching let an authenticated caller request
   any object in the same bucket via the dataset endpoint.

3. Forward Range request and accept-ranges/content-range response headers
   through the proxy so 206 Partial Content semantics survive (needed for
   video/large-file preview).

4. Remove the legacy 'url=' fallback from the proxy. It accepted any
   http(s) URL — a clear SSRF vector (cloud metadata endpoints, RFC1918,
   loopback). Manifests since #795 always carry storage_path; the legacy
   path was effectively dead. UI's file-preview-panel updated to drop the
   matching client-side fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Hans Arnholm <harnholm@nvidia.com>
Two review fixes for the auth-sensitive dataset preview path:

1. Stop forwarding upstream Cache-Control on /proxy/dataset/file. Storage
   providers commonly set 'public, max-age=…' on object responses; passing
   that through on a per-user authenticated route lets intermediate caches
   serve one user's bytes to another. Drop 'cache-control' from
   FORWARDED_RESPONSE_HEADERS and explicitly set 'Cache-Control:
   private, no-store' on every response. ('vary' was never forwarded — left
   as is.)

2. Drop staleTime: Infinity from the file-preview HEAD useQuery. Caching a
   401/403 forever would prevent revalidation after the user's session
   refreshes. staleTime: 0 restores standard React Query revalidation
   semantics (refetch on focus/mount when stale).

The text-preview useQuery one block up still has staleTime: Infinity; left
alone here since the reviewer scoped the finding to the HEAD query, but
worth a follow-up if we see stuck 401s on the text preview specifically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Hans Arnholm <harnholm@nvidia.com>
@cypres cypres force-pushed the harnholm/cw-storage-fixes branch from 2d13ce3 to b9e76a1 Compare May 5, 2026 23:50
@cypres cypres temporarily deployed to internal-ci May 5, 2026 23:51 — with GitHub Actions Inactive
…set link

query_dataset's metadata-enabled branch was the third parse_uri_to_link
callsite I missed when threading override_url + addressing_style for
private S3-compatible buckets — get_collection_info and get_dataset_info
already do this. Without it, /api/bucket/{bucket}/query results
on a CAIOS-backed bucket return AWS-pattern hostnames that ENOTFOUND on
the UI side, same failure mode the original parse_uri_to_link bug had.

Mirrors the default_credential fallback pattern (override_url and
addressing_style live on bucket_config.default_credential, not directly
on bucket_config).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Hans Arnholm <harnholm@nvidia.com>
@cypres cypres temporarily deployed to internal-ci May 6, 2026 00:01 — with GitHub Actions Inactive
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/service/core/data/data_service.py (1)

1009-1035: 💤 Low value

Consider adding a more specific return type annotation.

The return type -> List is quite generic. For better type safety and API documentation, consider specifying the manifest entry structure:

-) -> List:
+) -> List[Dict[str, Any]]:

Or define a Pydantic model for manifest entries if the structure is well-defined.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/service/core/data/data_service.py` around lines 1009 - 1035, The function
get_manifest currently uses a generic return annotation -> List; change it to a
specific type (e.g., List[ManifestEntry] or List[Dict[str, Any]) to improve type
safety and docs: define a Pydantic model ManifestEntry (subclass
pydantic.BaseModel) capturing the manifest fields or choose
typing.List[Dict[str, Any]] if structure is dynamic, update the function
signature from def get_manifest(...) -> List to def get_manifest(...) ->
List[ManifestEntry] (or List[Dict[str, Any]]), and replace the raw
json.loads(manifest_content) return with parsing to the chosen type (e.g.,
pydantic.parse_obj_as(List[ManifestEntry], json.loads(...)) or cast to
List[Dict[str, Any]]); add the necessary imports for typing or pydantic and keep
existing logic in get_manifest unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/service/core/data/data_service.py`:
- Around line 1009-1035: The function get_manifest currently uses a generic
return annotation -> List; change it to a specific type (e.g.,
List[ManifestEntry] or List[Dict[str, Any]) to improve type safety and docs:
define a Pydantic model ManifestEntry (subclass pydantic.BaseModel) capturing
the manifest fields or choose typing.List[Dict[str, Any]] if structure is
dynamic, update the function signature from def get_manifest(...) -> List to def
get_manifest(...) -> List[ManifestEntry] (or List[Dict[str, Any]]), and replace
the raw json.loads(manifest_content) return with parsing to the chosen type
(e.g., pydantic.parse_obj_as(List[ManifestEntry], json.loads(...)) or cast to
List[Dict[str, Any]]); add the necessary imports for typing or pydantic and keep
existing logic in get_manifest unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a0f768d9-9261-4c0e-8ddf-e321780c6e62

📥 Commits

Reviewing files that changed from the base of the PR and between b9e76a1 and 7f99f57.

📒 Files selected for processing (1)
  • src/service/core/data/data_service.py

@cypres cypres merged commit 81ca71c into main May 6, 2026
12 checks passed
@cypres cypres deleted the harnholm/cw-storage-fixes branch May 6, 2026 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants