Support DefaultDataCredential in Service (Workflow Data, Logs, Apps)#865
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughWidened credential abstractions to DataCredential across connectors and jobs, made MountURL avoid exporting empty AWS creds and region, normalized STS assumed-role ARNs to IAM role ARNs for IAM policy simulation in S3 auth, added unit tests for DefaultDataCredential serialization and config behavior. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant S3 as S3 (head/list)
participant STS as STS
participant IAM as IAM (simulate_principal_policy)
Client->>S3: data_auth with DefaultDataCredential -> perform head_bucket/list_buckets
alt DefaultDataCredential handles ambient auth
S3-->>Client: bucket access validated
else Non-default credential
Client->>STS: GetCallerIdentity (via session)
STS-->>Client: caller_arn
Note right of Client: if ARN contains :assumed-role/\nconvert to arn:aws:iam::<acct>:role/<role>
Client->>IAM: simulate_principal_policy(PolicySourceArn=normalized_arn,...)
IAM-->>Client: simulation result
Client->>S3: proceed based on simulation result
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/runtime/pkg/data/data.go`:
- Around line 421-427: The current mount credential handling in data.go (using
dataCredential) only sets AWS env vars when AccessKeyId/Region are non-empty but
never clears them, and AWS_SECRET_ACCESS_KEY is gated by AccessKeyId which can
leave stale credentials in-process; update the logic in the mount setup that
reads dataCredential so that for each field you either set the corresponding env
var when non-empty or explicitly unset it when empty (at minimum clear
AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN and AWS_REGION per
mount), and fix the gating so AWS_SECRET_ACCESS_KEY is checked/cleared based on
dataCredential.AccessKey (or AccessKeyId as appropriate) rather than only
AccessKeyId to avoid carrying over stale static credentials between mounts.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 5c8a5954-c34e-4ef7-932e-1e828ad1519a
📒 Files selected for processing (6)
src/lib/data/storage/backends/tests/test_backends.pysrc/runtime/pkg/data/data.gosrc/utils/connectors/postgres.pysrc/utils/job/task.pysrc/utils/job/tests/BUILDsrc/utils/job/tests/test_task.py
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #865 +/- ##
==========================================
+ Coverage 44.26% 47.34% +3.08%
==========================================
Files 207 207
Lines 27641 29764 +2123
Branches 7906 9081 +1175
==========================================
+ Hits 12234 14093 +1859
- Misses 15291 15561 +270
+ Partials 116 110 -6
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/data/storage/backends/backends.py`:
- Around line 515-524: The ARN conversion logic currently drops path qualifiers
and hard-codes the "aws" partition; update the block that checks
':assumed-role/' and rewrites arn to (1) extract the partition from parts[1]
instead of using "aws", (2) parse parts[5] so you keep the entire role path
(everything between "assumed-role/" and the final session segment) rather than
only the second slash segment, and (3) construct the IAM role ARN as
f'arn:{partition}:iam::{account_id}:role/{role_path}'. Ensure the parsing is
robust by splitting parts[5] on 'assumed-role/' then splitting the remainder on
the last '/' to separate the session name, using the left side as role_path;
keep using the existing variable name arn so downstream code is unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 3fdba626-eb9a-4944-a811-a6e6cc722d88
📒 Files selected for processing (1)
src/lib/data/storage/backends/backends.py
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/lib/data/storage/backends/backends.py (1)
517-526:⚠️ Potential issue | 🟠 MajorAssumed-role ARN rewrite still drops role paths and hard-codes the partition.
This still breaks path-qualified roles:
parts[5].split('/')[1]keeps only the first segment afterassumed-role/, andarn:aws:iam::...ignores the source partition.simulate_principal_policywill keep failing for roles likeassumed-role/service-role/MyRole/sessionand for non-awspartitions.AWS STS assumed-role ARN format for path-qualified roles and required PolicySourceArn format for IAM SimulatePrincipalPolicy🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/data/storage/backends/backends.py` around lines 517 - 526, The rewrite of assumed-role ARNs must preserve the original partition and full role path (not just a single segment); change the logic that detects ':assumed-role/' so it extracts partition = parts[1], account_id = parts[4], then strip the leading "assumed-role/" from parts[5], remove only the trailing session segment (i.e. split once to get "<role-path>/<session>" then rsplit once to drop the session) to get the full role path, and rebuild the PolicySourceArn as f'arn:{partition}:iam::{account_id}:role/{role_path}' (use the existing arn variable and parts array names to locate and replace the current transformation).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/data/storage/backends/backends.py`:
- Around line 488-496: The current branch returns early for
credentials.DefaultDataCredential after calling
_validate_bucket_access(data_cred=...), which only checks bucket-level
reachability and skips access_type-specific object permissions; update this to
either (a) allow the ambient-credential fallback only when access_type ==
AccessType.READ (or equivalent enum) and continue to perform object-level
SimulatePrincipalPolicy checks for WRITE/DELETE, or (b) explicitly reject/raise
for WRITE and DELETE when using DefaultDataCredential with a clear error,
referencing the same symbols (data_cred, credentials.DefaultDataCredential,
_validate_bucket_access, and access_type) so callers cannot bypass object-level
permission validation.
---
Duplicate comments:
In `@src/lib/data/storage/backends/backends.py`:
- Around line 517-526: The rewrite of assumed-role ARNs must preserve the
original partition and full role path (not just a single segment); change the
logic that detects ':assumed-role/' so it extracts partition = parts[1],
account_id = parts[4], then strip the leading "assumed-role/" from parts[5],
remove only the trailing session segment (i.e. split once to get
"<role-path>/<session>" then rsplit once to drop the session) to get the full
role path, and rebuild the PolicySourceArn as
f'arn:{partition}:iam::{account_id}:role/{role_path}' (use the existing arn
variable and parts array names to locate and replace the current
transformation).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 60009d2a-4b9b-4744-bcc4-6c383d0a71cd
📒 Files selected for processing (4)
src/lib/data/storage/backends/backends.pysrc/utils/connectors/postgres.pysrc/utils/job/task.pysrc/utils/job/workflow.py
…cure auth and skrl 2.0.0 compatibility (#492) ## Description Upgraded the OSMO platform from chart 1.0.1 (image 6.0.0) to chart 1.2.1 (image 6.2) and replaced the broken dev-mode authentication with **token-based auth** backed by a Key Vault credential lifecycle. The previous deployment had three compounding failures — backend CrashLoopBackOff from invalid `loginMethod`, 404 on all auth APIs, and silently ignored `defaultAdmin` values — making the platform inoperable. This PR also absorbed the **skrl 2.0.0 breaking API change** (private `_update` → public `update` with keyword-only arguments) and corrected several training dependency pins that caused import failures in GPU containers. End-to-end training validated on both OSMO and AzureML. Closes #478 ## Type of Change - [x] 🐛 Bug fix (non-breaking change fixing an issue) - [x] ✨ New feature (non-breaking change adding functionality) - [ ] 💥 Breaking change (fix or feature causing existing functionality to change) - [ ] 📚 Documentation update - [x] 🏗️ Infrastructure change (Terraform/IaC) - [ ] ♻️ Refactoring (no functional changes) ## Component(s) Affected - [ ] `infrastructure/terraform/prerequisites/` - Azure subscription setup - [x] `infrastructure/terraform/` - Terraform infrastructure - [x] `infrastructure/setup/` - OSMO control plane / Helm - [x] `workflows/` - Training and evaluation workflows - [x] `training/` - Training pipelines and scripts - [ ] `docs/` - Documentation ## Testing Performed - [x] Terraform `plan` reviewed (no unexpected changes) - [x] Terraform `apply` tested in dev environment - [ ] Training scripts tested locally with Isaac Sim - [x] OSMO workflow submitted successfully - [ ] Smoke tests passed (`smoke_test_azure.py`) ### Validated Training Runs | Run | Platform | Status | Throughput | |-----|----------|--------|------------| | `isaaclab-inline-training-23` | OSMO (workload identity) | COMPLETED | 8.3 it/s | | `isaaclab-inline-training-24` | OSMO (workload identity + dependabot bumps) | COMPLETED | — | | `redacted` | AzureML K8s compute | SUBMITTED | — | ### Validation Commands Run - `ruff check training/ evaluation/` — passed - `pytest training/ evaluation/` — 43 passed, 92.5% coverage ## Documentation Impact - [x] No documentation changes needed - [ ] Documentation updated in this PR - [ ] Documentation issue filed ## Bug Fix Checklist - [x] Linked to issue being fixed - [ ] Regression test included, OR - [x] Justification for no regression test: OSMO deployment and skrl training validated through live GPU runs on the target cluster. Unit tests cover skrl wrapper logic. Infrastructure changes are inherently integration-tested through the deploy + train cycle. ## Changes ### OSMO 6.2 Platform Upgrade The Helm chart and image versions were bumped across all four releases (control plane, backend operator, router, UI). The control plane values introduced a **`defaultAdmin` block** — NVIDIA's bootstrap mechanism for non-IdP deployments — and removed `auth.enabled: false`, which was disabling all user/token CRUD endpoints. - Updated *defaults.conf* with chart 1.2.1, image 6.2, and CLI version 6.2.10 - Added `defaultAdmin` configuration and Entra IdP placeholder comments in *osmo-control-plane.yaml* - Bumped image tag to 6.2 in *osmo-backend-operator.yaml*, *osmo-router.yaml*, and *osmo-ui.yaml* ### Authentication and Credential Management > The old `--method dev --username guest` login path was removed in OSMO 6.x. Admin credentials now follow the same Key Vault lifecycle as the PostgreSQL password: Terraform generates → Key Vault stores → CSI syncs to K8s secret. - Added `random_password.osmo_admin` (43-char alphanumeric, URL-safe for Bearer tokens) and `azapi_resource` for Key Vault storage in *security.tf* - Added `osmo-admin-password` to both CSI SecretProviderClass manifests for automatic K8s secret sync - Rewrote **`osmo_login_and_setup`** in *common.sh* to use `--method token --token-file` with process substitution - Switched *04-deploy-osmo-backend.sh* from raw `curl` POST to `osmo token set` CLI with 90-day expiry (was 1 year) and `--from-file` secret creation to avoid credential exposure in process args - Added **`service_base_url` self-healing** in script 04 — detects when script 03 couldn't set it (private cluster, no port-forward) and applies the in-cluster ingress URL so workflow pod sidecars can reach the control plane ### skrl 2.0.0 API Migration The upstream skrl library promoted the agent update method from a private to a public interface with keyword-only arguments. - Changed protocol definition and wrapper in *skrl_mlflow_agent.py* from `_update(timestep, timesteps)` to `update(*, timestep, timesteps)` - Updated monkey-patch assignment in *skrl_training.py* - Migrated eval mode in *policy_evaluation.py* from `set_running_mode("eval")` to `enable_training_mode(enabled=False, apply_to_models=True)` and added required second positional argument to `act()` ### Training Dependencies - Pinned **numpy** to `>=1.26.0,<2.0.0` in *pyproject.toml* — Isaac Sim 4.x bundles numpy 1.x-compiled native extensions, and numpy 2.x changes the C ABI - Downgraded **marshmallow** from 4.x to 3.25.1 — `azure-ai-ml` requires `marshmallow>=3.5,<4.0.0` - Aligned **pydantic** 2.13.1 / **pydantic-core** 2.46.1 — mismatched versions raise `SystemError` on import - Applied 9 transitive dependency bumps from dependabot PR #490 - Replaced runtime `uv pip compile` in *train.sh* with `uv pip install --no-deps` from the pre-locked *requirements.txt*, removing non-deterministic resolution during GPU-billed training startup - Made `--max_iterations` conditional in *train.yaml* to prevent empty-string argparse failures ## Notes - **Known limitation:** `osmo workflow logs` returns 500 when `shared_access_key_enabled = false` on the storage account. The OSMO service internally calls `BlobServiceClient.from_connection_string()` which does not support `DefaultAzureCredential`. Upstream fix tracked in [NVIDIA/OSMO PR #865](NVIDIA/OSMO#865). Workaround: `kubectl logs` for live output, `az storage blob download` for completed runs. Tracked in #491 - Dataset access-keys template gained an `endpoint` field in *dataset-config-access-keys.template.json* ## Checklist - [x] My code follows the [project conventions](copilot-instructions.md) - [x] Commit messages follow [conventional commit format](instructions/commit-message.instructions.md) - [x] I have performed a self-review - [x] Documentation impact assessed above - [x] No new linting warnings introduced
Description
Ported changed from #751 for upcoming release
Issue - None
Checklist
Summary by CodeRabbit
Bug Fixes
New Features
Tests