Skip to content

Support DefaultDataCredential in Service (Workflow Data, Logs, Apps)#865

Merged
fernandol-nvidia merged 8 commits intomainfrom
KeitaW/feat/support-default-data-credential
Apr 16, 2026
Merged

Support DefaultDataCredential in Service (Workflow Data, Logs, Apps)#865
fernandol-nvidia merged 8 commits intomainfrom
KeitaW/feat/support-default-data-credential

Conversation

@fernandol-nvidia
Copy link
Copy Markdown
Contributor

@fernandol-nvidia fernandol-nvidia commented Apr 15, 2026

Description

Ported changed from #751 for upcoming release

Issue - None

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Summary by CodeRabbit

  • Bug Fixes

    • AWS credential env vars no longer overwrite SDK defaults when empty.
    • STS assumed-role ARNs normalized to IAM role ARNs for correct policy simulation.
    • Bucket default-credential indicator now reflects actual presence.
    • Dataset list command omits empty name parameter.
  • New Features

    • Region can be set via credential information.
  • Tests

    • Added unit tests for credential serialization and handling across credential types.

@fernandol-nvidia fernandol-nvidia requested a review from a team as a code owner April 15, 2026 19:26
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 15, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Widened credential abstractions to DataCredential across connectors and jobs, made MountURL avoid exporting empty AWS creds and region, normalized STS assumed-role ARNs to IAM role ARNs for IAM policy simulation in S3 auth, added unit tests for DefaultDataCredential serialization and config behavior.

Changes

Cohort / File(s) Summary
Connector & job type changes
src/utils/connectors/postgres.py, src/utils/job/task.py, src/utils/job/workflow.py
Switched signatures and mappings to accept credentials.DataCredential/Mapping[...] instead of concrete StaticDataCredential; added _resolve_bucket_credential(...) to materialize bucket-default credentials; adjusted credential resolution and caching in workflow validation.
S3 auth & ARN normalization
src/lib/data/storage/backends/backends.py
Treat DefaultDataCredential as ambient (perform real bucket checks) and normalize STS assumed-role ARNs (:assumed-role/) into IAM role ARNs before calling simulate_principal_policy; unified session/client creation for other creds.
Runtime AWS env handling
src/runtime/pkg/data/data.go
MountURL sets AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY only when access key ID is non-empty and sets AWS_REGION only when region is non-empty to avoid blank overrides of ambient credentials.
Tests & test deps
src/lib/data/storage/backends/tests/test_backends.py, src/utils/job/tests/test_task.py, src/utils/job/tests/BUILD
Added unit tests validating DefaultDataCredential preservation and to_decrypted_dict() hides access keys while preserving endpoint/region; updated test imports and added //src/lib/utils:credentials dependency.
CLI request params change
src/cli/dataset.py
_run_list_command builds request params conditionally: only include name, user, buckets when provided; always include all_users, count, and uppercased order.
Bucket info default_cred flag
src/service/core/data/data_service.py
get_bucket_info now sets default_cred based solely on presence of bucket_info.default_credential (no longer checks for non-empty access key fields).

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant S3 as S3 (head/list)
    participant STS as STS
    participant IAM as IAM (simulate_principal_policy)
    Client->>S3: data_auth with DefaultDataCredential -> perform head_bucket/list_buckets
    alt DefaultDataCredential handles ambient auth
        S3-->>Client: bucket access validated
    else Non-default credential
        Client->>STS: GetCallerIdentity (via session)
        STS-->>Client: caller_arn
        Note right of Client: if ARN contains :assumed-role/\nconvert to arn:aws:iam::<acct>:role/<role>
        Client->>IAM: simulate_principal_policy(PolicySourceArn=normalized_arn,...)
        IAM-->>Client: simulation result
        Client->>S3: proceed based on simulation result
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I sniff the creds, a careful hop,

I skip the blanks and only stop to cop
the region kept and keys tucked tight,
ARNs trimmed neat before they take flight.
Tests hum softly — tidy, spry, and right.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main objective of the pull request, which is to support DefaultDataCredential across service layers (Workflow Data, Logs, and Apps).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch KeitaW/feat/support-default-data-credential

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/runtime/pkg/data/data.go`:
- Around line 421-427: The current mount credential handling in data.go (using
dataCredential) only sets AWS env vars when AccessKeyId/Region are non-empty but
never clears them, and AWS_SECRET_ACCESS_KEY is gated by AccessKeyId which can
leave stale credentials in-process; update the logic in the mount setup that
reads dataCredential so that for each field you either set the corresponding env
var when non-empty or explicitly unset it when empty (at minimum clear
AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN and AWS_REGION per
mount), and fix the gating so AWS_SECRET_ACCESS_KEY is checked/cleared based on
dataCredential.AccessKey (or AccessKeyId as appropriate) rather than only
AccessKeyId to avoid carrying over stale static credentials between mounts.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 5c8a5954-c34e-4ef7-932e-1e828ad1519a

📥 Commits

Reviewing files that changed from the base of the PR and between 8fde840 and bc8f0f2.

📒 Files selected for processing (6)
  • src/lib/data/storage/backends/tests/test_backends.py
  • src/runtime/pkg/data/data.go
  • src/utils/connectors/postgres.py
  • src/utils/job/task.py
  • src/utils/job/tests/BUILD
  • src/utils/job/tests/test_task.py

Comment thread src/runtime/pkg/data/data.go Outdated
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 15, 2026

Codecov Report

❌ Patch coverage is 36.61972% with 45 lines in your changes missing coverage. Please review.
✅ Project coverage is 47.34%. Comparing base (626a626) to head (ecd9b09).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
src/lib/data/storage/backends/backends.py 4.54% 21 Missing ⚠️
src/utils/job/workflow.py 38.09% 13 Missing ⚠️
src/cli/dataset.py 0.00% 7 Missing ⚠️
src/utils/connectors/postgres.py 83.33% 3 Missing ⚠️
src/utils/job/task.py 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #865      +/-   ##
==========================================
+ Coverage   44.26%   47.34%   +3.08%     
==========================================
  Files         207      207              
  Lines       27641    29764    +2123     
  Branches     7906     9081    +1175     
==========================================
+ Hits        12234    14093    +1859     
- Misses      15291    15561     +270     
+ Partials      116      110       -6     
Flag Coverage Δ
backend 47.69% <32.39%> (+1.93%) ⬆️
ui 28.83% <ø> (+0.59%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/service/core/data/data_service.py 15.54% <ø> (+4.70%) ⬆️
src/utils/job/task.py 68.52% <66.66%> (+12.66%) ⬆️
src/utils/connectors/postgres.py 80.21% <83.33%> (+5.02%) ⬆️
src/cli/dataset.py 13.26% <0.00%> (+6.28%) ⬆️
src/utils/job/workflow.py 62.88% <38.09%> (+12.61%) ⬆️
src/lib/data/storage/backends/backends.py 62.06% <4.54%> (+4.04%) ⬆️

... and 25 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/data/storage/backends/backends.py`:
- Around line 515-524: The ARN conversion logic currently drops path qualifiers
and hard-codes the "aws" partition; update the block that checks
':assumed-role/' and rewrites arn to (1) extract the partition from parts[1]
instead of using "aws", (2) parse parts[5] so you keep the entire role path
(everything between "assumed-role/" and the final session segment) rather than
only the second slash segment, and (3) construct the IAM role ARN as
f'arn:{partition}:iam::{account_id}:role/{role_path}'. Ensure the parsing is
robust by splitting parts[5] on 'assumed-role/' then splitting the remainder on
the last '/' to separate the session name, using the left side as role_path;
keep using the existing variable name arn so downstream code is unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 3fdba626-eb9a-4944-a811-a6e6cc722d88

📥 Commits

Reviewing files that changed from the base of the PR and between 4681116 and 30b6ff1.

📒 Files selected for processing (1)
  • src/lib/data/storage/backends/backends.py

Comment thread src/lib/data/storage/backends/backends.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/lib/data/storage/backends/backends.py (1)

517-526: ⚠️ Potential issue | 🟠 Major

Assumed-role ARN rewrite still drops role paths and hard-codes the partition.

This still breaks path-qualified roles: parts[5].split('/')[1] keeps only the first segment after assumed-role/, and arn:aws:iam::... ignores the source partition. simulate_principal_policy will keep failing for roles like assumed-role/service-role/MyRole/session and for non-aws partitions.

AWS STS assumed-role ARN format for path-qualified roles and required PolicySourceArn format for IAM SimulatePrincipalPolicy
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/data/storage/backends/backends.py` around lines 517 - 526, The
rewrite of assumed-role ARNs must preserve the original partition and full role
path (not just a single segment); change the logic that detects ':assumed-role/'
so it extracts partition = parts[1], account_id = parts[4], then strip the
leading "assumed-role/" from parts[5], remove only the trailing session segment
(i.e. split once to get "<role-path>/<session>" then rsplit once to drop the
session) to get the full role path, and rebuild the PolicySourceArn as
f'arn:{partition}:iam::{account_id}:role/{role_path}' (use the existing arn
variable and parts array names to locate and replace the current
transformation).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/data/storage/backends/backends.py`:
- Around line 488-496: The current branch returns early for
credentials.DefaultDataCredential after calling
_validate_bucket_access(data_cred=...), which only checks bucket-level
reachability and skips access_type-specific object permissions; update this to
either (a) allow the ambient-credential fallback only when access_type ==
AccessType.READ (or equivalent enum) and continue to perform object-level
SimulatePrincipalPolicy checks for WRITE/DELETE, or (b) explicitly reject/raise
for WRITE and DELETE when using DefaultDataCredential with a clear error,
referencing the same symbols (data_cred, credentials.DefaultDataCredential,
_validate_bucket_access, and access_type) so callers cannot bypass object-level
permission validation.

---

Duplicate comments:
In `@src/lib/data/storage/backends/backends.py`:
- Around line 517-526: The rewrite of assumed-role ARNs must preserve the
original partition and full role path (not just a single segment); change the
logic that detects ':assumed-role/' so it extracts partition = parts[1],
account_id = parts[4], then strip the leading "assumed-role/" from parts[5],
remove only the trailing session segment (i.e. split once to get
"<role-path>/<session>" then rsplit once to drop the session) to get the full
role path, and rebuild the PolicySourceArn as
f'arn:{partition}:iam::{account_id}:role/{role_path}' (use the existing arn
variable and parts array names to locate and replace the current
transformation).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 60009d2a-4b9b-4744-bcc4-6c383d0a71cd

📥 Commits

Reviewing files that changed from the base of the PR and between ad69077 and 38c5c57.

📒 Files selected for processing (4)
  • src/lib/data/storage/backends/backends.py
  • src/utils/connectors/postgres.py
  • src/utils/job/task.py
  • src/utils/job/workflow.py

Comment thread src/lib/data/storage/backends/backends.py
@fernandol-nvidia fernandol-nvidia merged commit 9c4286c into main Apr 16, 2026
23 of 25 checks passed
@fernandol-nvidia fernandol-nvidia deleted the KeitaW/feat/support-default-data-credential branch April 16, 2026 17:27
katriendg added a commit to microsoft/physical-ai-toolchain that referenced this pull request Apr 20, 2026
…cure auth and skrl 2.0.0 compatibility (#492)

## Description

Upgraded the OSMO platform from chart 1.0.1 (image 6.0.0) to chart 1.2.1
(image 6.2) and replaced the broken dev-mode authentication with
**token-based auth** backed by a Key Vault credential lifecycle. The
previous deployment had three compounding failures — backend
CrashLoopBackOff from invalid `loginMethod`, 404 on all auth APIs, and
silently ignored `defaultAdmin` values — making the platform inoperable.

This PR also absorbed the **skrl 2.0.0 breaking API change** (private
`_update` → public `update` with keyword-only arguments) and corrected
several training dependency pins that caused import failures in GPU
containers. End-to-end training validated on both OSMO and AzureML.

Closes #478

## Type of Change

- [x] 🐛 Bug fix (non-breaking change fixing an issue)
- [x] ✨ New feature (non-breaking change adding functionality)
- [ ] 💥 Breaking change (fix or feature causing existing functionality
to change)
- [ ] 📚 Documentation update
- [x] 🏗️ Infrastructure change (Terraform/IaC)
- [ ] ♻️ Refactoring (no functional changes)

## Component(s) Affected

- [ ] `infrastructure/terraform/prerequisites/` - Azure subscription
setup
- [x] `infrastructure/terraform/` - Terraform infrastructure
- [x] `infrastructure/setup/` - OSMO control plane / Helm
- [x] `workflows/` - Training and evaluation workflows
- [x] `training/` - Training pipelines and scripts
- [ ] `docs/` - Documentation

## Testing Performed

- [x] Terraform `plan` reviewed (no unexpected changes)
- [x] Terraform `apply` tested in dev environment
- [ ] Training scripts tested locally with Isaac Sim
- [x] OSMO workflow submitted successfully
- [ ] Smoke tests passed (`smoke_test_azure.py`)

### Validated Training Runs

| Run | Platform | Status | Throughput |
|-----|----------|--------|------------|
| `isaaclab-inline-training-23` | OSMO (workload identity) | COMPLETED |
8.3 it/s |
| `isaaclab-inline-training-24` | OSMO (workload identity + dependabot
bumps) | COMPLETED | — |
| `redacted` | AzureML K8s compute | SUBMITTED | — |

### Validation Commands Run

- `ruff check training/ evaluation/` — passed
- `pytest training/ evaluation/` — 43 passed, 92.5% coverage

## Documentation Impact

- [x] No documentation changes needed
- [ ] Documentation updated in this PR
- [ ] Documentation issue filed

## Bug Fix Checklist

- [x] Linked to issue being fixed
- [ ] Regression test included, OR
- [x] Justification for no regression test: OSMO deployment and skrl
training validated through live GPU runs on the target cluster. Unit
tests cover skrl wrapper logic. Infrastructure changes are inherently
integration-tested through the deploy + train cycle.

## Changes

### OSMO 6.2 Platform Upgrade

The Helm chart and image versions were bumped across all four releases
(control plane, backend operator, router, UI). The control plane values
introduced a **`defaultAdmin` block** — NVIDIA's bootstrap mechanism for
non-IdP deployments — and removed `auth.enabled: false`, which was
disabling all user/token CRUD endpoints.

- Updated *defaults.conf* with chart 1.2.1, image 6.2, and CLI version
6.2.10
- Added `defaultAdmin` configuration and Entra IdP placeholder comments
in *osmo-control-plane.yaml*
- Bumped image tag to 6.2 in *osmo-backend-operator.yaml*,
*osmo-router.yaml*, and *osmo-ui.yaml*

### Authentication and Credential Management

> The old `--method dev --username guest` login path was removed in OSMO
6.x. Admin credentials now follow the same Key Vault lifecycle as the
PostgreSQL password: Terraform generates → Key Vault stores → CSI syncs
to K8s secret.

- Added `random_password.osmo_admin` (43-char alphanumeric, URL-safe for
Bearer tokens) and `azapi_resource` for Key Vault storage in
*security.tf*
- Added `osmo-admin-password` to both CSI SecretProviderClass manifests
for automatic K8s secret sync
- Rewrote **`osmo_login_and_setup`** in *common.sh* to use `--method
token --token-file` with process substitution
- Switched *04-deploy-osmo-backend.sh* from raw `curl` POST to `osmo
token set` CLI with 90-day expiry (was 1 year) and `--from-file` secret
creation to avoid credential exposure in process args
- Added **`service_base_url` self-healing** in script 04 — detects when
script 03 couldn't set it (private cluster, no port-forward) and applies
the in-cluster ingress URL so workflow pod sidecars can reach the
control plane

### skrl 2.0.0 API Migration

The upstream skrl library promoted the agent update method from a
private to a public interface with keyword-only arguments.

- Changed protocol definition and wrapper in *skrl_mlflow_agent.py* from
`_update(timestep, timesteps)` to `update(*, timestep, timesteps)`
- Updated monkey-patch assignment in *skrl_training.py*
- Migrated eval mode in *policy_evaluation.py* from
`set_running_mode("eval")` to `enable_training_mode(enabled=False,
apply_to_models=True)` and added required second positional argument to
`act()`

### Training Dependencies

- Pinned **numpy** to `>=1.26.0,<2.0.0` in *pyproject.toml* — Isaac Sim
4.x bundles numpy 1.x-compiled native extensions, and numpy 2.x changes
the C ABI
- Downgraded **marshmallow** from 4.x to 3.25.1 — `azure-ai-ml` requires
`marshmallow>=3.5,<4.0.0`
- Aligned **pydantic** 2.13.1 / **pydantic-core** 2.46.1 — mismatched
versions raise `SystemError` on import
- Applied 9 transitive dependency bumps from dependabot PR #490
- Replaced runtime `uv pip compile` in *train.sh* with `uv pip install
--no-deps` from the pre-locked *requirements.txt*, removing
non-deterministic resolution during GPU-billed training startup
- Made `--max_iterations` conditional in *train.yaml* to prevent
empty-string argparse failures

## Notes

- **Known limitation:** `osmo workflow logs` returns 500 when
`shared_access_key_enabled = false` on the storage account. The OSMO
service internally calls `BlobServiceClient.from_connection_string()`
which does not support `DefaultAzureCredential`. Upstream fix tracked in
[NVIDIA/OSMO PR #865](NVIDIA/OSMO#865).
Workaround: `kubectl logs` for live output, `az storage blob download`
for completed runs. Tracked in #491
- Dataset access-keys template gained an `endpoint` field in
*dataset-config-access-keys.template.json*

## Checklist

- [x] My code follows the [project conventions](copilot-instructions.md)
- [x] Commit messages follow [conventional commit
format](instructions/commit-message.instructions.md)
- [x] I have performed a self-review
- [x] Documentation impact assessed above
- [x] No new linting warnings introduced
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants