Skip to content

6.3.0

Latest

Choose a tag to compare

@tdewanNvidia tdewanNvidia released this 05 May 20:44
· 32 commits to main since this release
07b7140

Highlights

  • ConfigMap-based configuration — All service configs (pools, backends, pod templates, roles, and more) can now be managed as Helm values via a Kubernetes ConfigMap, following standard K8s patterns and enabling GitOps workflows.
  • TLS support — The service chart now terminates TLS at the gateway, with values for cert/key, redirect from HTTP, and SAN configuration.
  • Service chart consolidation — The standalone router and web-ui Helm charts have been folded into the service chart, making a full deployment a single Helm release.
  • Multi-provider deploy scriptsdeploy-k8s.sh now provisions OSMO on Azure AKS, AWS EKS, microk8s, or any existing Kubernetes cluster, with idempotent installers for KAI Scheduler, GPU Operator, MinIO, and configurable storage backends (MinIO, Azure Blob, AWS S3, BYO S3).
  • Per-group timeoutsexec_timeout and queue_timeout now meter each group independently instead of running against the workflow as a whole, so a stuck simulation group no longer kills the rest of the workflow.
  • Dataset CLI and API deprecatedosmo dataset commands and the /datasets API endpoints are deprecated and will be removed in 6.4. Migrate to workflow-managed dataset outputs.
  • Rsync download support — Pull files from running workflow tasks to your local machine with osmo workflow rsync download, complementing the existing upload capability.
  • Visual transfer progress — File sync operations now display a progress bar showing bytes transferred, percentage, rate, and ETA.
  • Workload identity for core services — Run OSMO services under a cloud-issued federated identity (Azure Workload Identity on AKS/Arc, AWS IRSA / EKS Pod Identity) via new cloud-neutral serviceAccount annotations and per-component extraPodLabels hooks, removing the need to mount cloud storage keys as Kubernetes Secrets.
  • Privilege escalation fix — Policies with empty resources lists no longer grant access to resource-scoped endpoints.

Breaking Changes

  • Router chart removed: The standalone router Helm chart is gone. Router pods now deploy as part of the service chart. Existing router resources (osmo-router, osmo-router-headless) continue to work, but you must remove the separate router Helm release before upgrading. See the 6.2 to 6.3 upgrade guide for migration steps. (#897)
  • Web UI chart removed: The standalone web-ui Helm chart has been merged into the service chart. Set ui.enabled: true in service values to deploy the UI alongside the API. Remove the separate web-ui release before upgrading. (#907)
  • Squid proxy removed from backend operator: The egress allowlist and squid-proxy sidecar have been removed from the backend operator chart. Network policies now restrict pod-to-pod access directly. (#823)
  • Per-group timeout semantics: exec_timeout and queue_timeout are now enforced per group (clock starts on the group's RUNNING or SCHEDULING transition) instead of per workflow. An expired group is marked FAILED_EXEC_TIMEOUT or FAILED_QUEUE_TIMEOUT; sibling groups continue and the workflow status aggregates only after all groups finish. (#925)
  • Dataset CLI and API deprecated: All osmo dataset subcommands print a stderr deprecation warning, and the /datasets REST endpoints are marked deprecated in the OpenAPI schema. The Datasets page in the UI shows a deprecation banner. Both will be removed in 6.4. (#872)
  • S3 addressing default: For S3-compatible backends with a custom endpoint_url, the addressing style now defaults to virtual-hosted instead of boto3's auto-selection (which picks path style for custom endpoints), fixing compatibility with providers that require virtual hosts. If a backend requires path addressing, set the addressing_style attribute to path, or force OSMO to always use path addressing via the AWS_S3_FORCE_PATH_STYLE environment variable. (#950)

Helm Charts

  • ConfigMap configuration mode: Set services.configs.enabled: true to manage all service configs via Helm values. CLI/API writes return HTTP 409 when active. The chart ships with default roles, pod templates, resource validations, backend, and pool. (#822)

  • ConfigMap mode for worker, agent, and logger: The ConfigMapWatcher now runs in the worker, agent, and logger services. Previously only the API service watched the ConfigMap, so workflow pods built by the worker could be constructed from stale config. (#926)

  • TLS termination at the gateway: Configure a serving cert/key, optional HTTP-to-HTTPS redirect, and SAN list via gateway.tls. The gateway template generates the matching Envoy listener config. (#953)

  • Cloud workload identity: New top-level serviceAccount block (create, name, annotations) and per-component extraPodLabels on agent, api, worker, logger, router, and delayedJobMonitor. The hooks are cloud-neutral — set the annotations and labels your CSP's identity webhook expects:

    • Azure (AKS / Arc): annotate the SA with azure.workload.identity/client-id: <uami-client-id> and label pods with azure.workload.identity/use: "true". The Azure storage backend falls back to DefaultAzureCredential when no static connection string is supplied.
    • AWS (EKS IRSA / Pod Identity): annotate the SA with eks.amazonaws.com/role-arn: <iam-role-arn>. The S3 backend picks up the federated token from boto3's default credential chain — no pod labels required.
  • Gateway consolidation: A unified gateway now handles load balancing for all service types (API, router, UI), simplifying ingress configuration. (#817, #799)

  • Gateway extension hooks: Inject custom Envoy filters and additive auth-skip paths via gateway.envoy.extensions and gateway.envoy.authSkipPaths, useful for sidecar integrations and bypassing authz on specific endpoints. (#1009)

  • Default identity headers: Minimal deployments can now inject default x-osmo-user, x-osmo-roles, and x-osmo-allowed-pools headers for unauthenticated browser requests via gateway.envoy.defaultIdentity values. (#902)

  • oauth2-proxy extraEnv: Expose environment variables on the oauth2-proxy container via gateway.oauth2Proxy.extraEnv, needed for Redis AUTH when using session storage. (#898)

  • Custom HPA metrics: Specify custom metrics for Horizontal Pod Autoscalers on service components. (#858)

  • Pool computed fields resolved at load time: ConfigMap pools no longer require pre-expanded parsed_pod_template and parsed_resource_validations, reducing config file size by ~60%. (#866)

  • Per-field Secret mounts: Create credential Secrets with kubectl --from-literal instead of packaging all fields into a single cred.yaml. (#884)

  • Default pod templates on default pool: The chart's default pool now sets common_pod_template, so workflows submitted without an explicit template pick up default_ctrl and default_user automatically. (#1010, #1012)

  • Backend-operator startup probe configurable: startupProbe thresholds on the backend listener and worker are now exposed in values, with relaxed defaults to handle slow image pulls on cold clusters. (#961)

  • Service startup probe extended: The API service startupProbe failure threshold now allows up to ~2 minutes for migrations and DB warm-up before the pod is restarted. (#967)

  • podMonitor disabled by default: Both the service and backend-operator charts now default podMonitor.enabled to false, avoiding errors on clusters without Prometheus Operator CRDs installed. (#962, #963)

  • Config export script: New deployments/upgrades/export_configs_to_helm.py exports existing database configs to Helm values format. (#866)

Deployment Scripts

  • Multi-provider deploy: deploy-k8s.sh provisions a Kubernetes cluster on Azure AKS, AWS EKS, microk8s, or registers an existing cluster, then installs OSMO end-to-end. Cluster-agnostic dependency installers detect existing KAI Scheduler, GPU Operator, and MinIO so re-runs are safe. (#979)
  • Storage backend wiring: configure-storage.sh provisions and registers the workflow storage backend for MinIO, Azure Blob, AWS S3, or a bring-your-own S3 endpoint, including credential creation and bucket setup. (#979, #988)
  • Idempotent token mint: Backend operator token reconciliation now deletes any pre-existing backend-token before re-minting, so partial prior runs and microk8s PVC carryover no longer wedge re-deploys. (#988)
  • Helm values for minimal install: deploy-osmo-minimal.sh accepts --values to layer custom Helm values on top of the minimal preset. (#993)

Workflow Execution

  • Per-group exec and queue timeouts: Each group's clock starts on its own RUNNING (exec) or SCHEDULING (queue) transition. Expired groups are marked FAILED_EXEC_TIMEOUT or FAILED_QUEUE_TIMEOUT; downstream groups cascade as FAILED_UPSTREAM, sibling groups keep running. Delayed jobs serialized before the upgrade fall back to the previous workflow-level enforcement with a warning log. (#925)
  • Pool quota accounting handles Jinja: osmo-ctrl resource requests and limits are now pre-rendered for pool-quota accounting, so templated values like {% if USER_CPU > 2 %}2{% else %}{{USER_CPU}}{% endif %} are counted correctly instead of being silently treated as zero. (#931)
  • service_auth wired into worker, agent, logger: These services now read service_auth and stop reading service_base_url from the database, fixing config-mode authentication for non-API pods. (#930)
  • KAI queues sync on every registration: Backend registration now syncs KAI Scheduler queues unconditionally, instead of only on the first registration. (#941)

CLI

  • Rsync download: Pull files from running tasks to your local machine with osmo workflow rsync download wf-id /remote/path:/local/path. (#792)
  • Rsync errors when remote source is missing: Downloads now fail loudly when the requested remote path doesn't exist on the task pod. Previously the rsync daemon exited 0 with zero files transferred and the CLI reported success while leaving the destination empty. (#1019)
  • Rsync shutdown error fixed: The spurious ValueError: Invalid file descriptor: -1 after a successful rsync download is gone. (#987)
  • Transfer progress bar: Rsync upload and download now display an in-place progress bar showing bytes, percentage, rate, and ETA. Suppress with --no-progress. (#826)
  • Structured JSON logs: Pass --log_format json (or set the equivalent values key on services) to emit single-line JSON logs compatible with Fluent Bit. (#888)
  • Uninstall script: Remove OSMO CLI with osmo-uninstall (macOS/Linux). (#710)
  • Agent skill prompt: The installer offers to install the OSMO agent skill for AI coding assistants during CLI installation. (#841)
  • Token expiry warning: The CLI warns when your access token is within 24 hours of expiring. (#711)
  • Token roles nargs: osmo token set --roles now accepts multiple roles as separate arguments instead of requiring a comma-separated list. (#754)
  • Dataset commands deprecated: osmo dataset * subcommands print a stderr deprecation warning. The commands and corresponding /datasets API will be removed in 6.4. (#872)

Web UI

  • Workflow version navigation: Navigate between workflow run versions using back/forward arrows in the details panel. (#834)
  • Task failure messages: Failed and canceled tasks now display their failure_message in the Details section, even when exit code is null. (#832, #833)
  • Sign out via oauth2-proxy: The Sign Out action now routes through the oauth2-proxy logout endpoint so the upstream session is cleared, not just the local cookie. (#996)
  • Exec cookies fix for multi-router deployments: Exec session cookies are now scoped correctly when multiple router services are running, so terminal sessions stay attached to the right backend. (#1003)
  • Datasets deprecation banner: The /datasets page shows a deprecation banner announcing the v6.4 removal. (#872)
  • Terminal resize: The web shell now responds to window resizes, fixing display issues with applications like vim. (#727, #717)
  • Filter and retry fixes: Resolved issues with workflow log filters, task retry display, and occupancy search fields. (#784)
  • Next.js 16.2.4 / Node 24.14.1: The UI now ships on Next.js 16.2.4 and Node 24.14.1. (#949)
  • CodeMirror deduplicated: @codemirror/state is now deduped to a single version, fixing intermittent editor crashes. (#955)

Authorization

  • Privilege escalation fix: Policies with an empty resources list now correctly match only unscoped endpoints. Previously, they incorrectly granted access to all resource-scoped paths (e.g., a user with auth:Token and no resources could manage other users' tokens). (#867)
  • Default role sync: A default role is now created before authz sidecar sync, preventing startup failures when no roles exist. (#791)
  • Default pool submission: Users with the osmo-user role can submit workflows to the default pool without explicit pool assignment. (#728)

Configuration Validation

  • Fail-fast startup: Pods crash-loop on malformed ConfigMap at startup instead of silently falling back to database mode. K8s rolling updates stall at the bad revision while healthy pods continue serving. (#889)
  • K8s Events on reload failure: When hot-reload validation fails, the loader emits a ConfigMapReloadFailed Warning event on the ConfigMap, visible via kubectl describe configmap and cluster monitoring. (#889)
  • Secret redaction in errors: Pydantic validation errors no longer echo secret values. Error messages show field path and type only. (#891)

Performance

  • Concurrent workflow file uploads: Workflow file uploads now run concurrently, reducing submit latency for workflows with many files. (#783)
  • Batch group and task inserts: Workflow submit now inserts all groups and tasks in a single atomic transaction instead of per-group calls. (#821)
  • Bulk query optimization: Workflow.fetch_from_db() reduced from 2N+2 queries to 3 queries regardless of group count. (#820)
  • Barrier notification batching: _notify_barrier uses a single SQL query and Redis pipeline instead of N individual fetches. (#756)
  • UpdateGroup performance: Task status aggregation uses a lightweight query instead of loading full task rows. (#742)

Bug Fixes

  • Credential env var collision: Multiple credentials with the same payload key name (e.g., both use key) no longer overwrite each other. Secret data keys are now namespaced with the credential name. (#839)
  • Credential names not masked: Credential names and field references (e.g., AWS_ACCESS_KEY_ID) are no longer incorrectly masked in workflow specs. (#744)
  • Dataset manifest sort: Fixed binary search mismatch in dataset manifest comparator. (#903)
  • Dataset browsing from private buckets: Dataset URLs for S3-compatible backends are now built against the credential's override_url instead of the AWS pattern, so the UI can fetch content from CAIOS, MinIO, and other non-AWS endpoints. (#957)
  • Storage credential setup errors: Clearer error messages when required fields are missing or malformed during credential creation. (#947)
  • OpenAPI schema generation: API schema export works again after the Pydantic v2 migration. (#985)
  • SSL truststore on Python 3.14 + microk8s: Patched ssl with truststore so HTTPS calls from in-cluster pods on Python 3.14 + microk8s pick up the system trust store. (#951)
  • Web UI base image (CVE-2026-2673): Bumped the web-ui base image to v4.0.5 to pick up the upstream fix. (#971)
  • Workflow file 403 handling: Streaming response now returns proper error when workflow file access is forbidden. (#730)
  • Authz path fixes: Corrected authorization paths for rsync, workflow exec, and credential create operations. (#739, #738, #737)

Getting OSMO

Helm Charts and Containers

Helm charts and container images are available on NGC.

CLI Client

Installers for the CLI client for macOS (Apple Silicon), x86-64 Linux, and ARM64 Linux are attached as assets to this release.