Highlights
- ConfigMap-based configuration — All service configs (pools, backends, pod templates, roles, and more) can now be managed as Helm values via a Kubernetes ConfigMap, following standard K8s patterns and enabling GitOps workflows.
- TLS support — The service chart now terminates TLS at the gateway, with values for cert/key, redirect from HTTP, and SAN configuration.
- Service chart consolidation — The standalone
routerandweb-uiHelm charts have been folded into theservicechart, making a full deployment a single Helm release. - Multi-provider deploy scripts —
deploy-k8s.shnow provisions OSMO on Azure AKS, AWS EKS, microk8s, or any existing Kubernetes cluster, with idempotent installers for KAI Scheduler, GPU Operator, MinIO, and configurable storage backends (MinIO, Azure Blob, AWS S3, BYO S3). - Per-group timeouts —
exec_timeoutandqueue_timeoutnow meter each group independently instead of running against the workflow as a whole, so a stuck simulation group no longer kills the rest of the workflow. - Dataset CLI and API deprecated —
osmo datasetcommands and the/datasetsAPI endpoints are deprecated and will be removed in 6.4. Migrate to workflow-managed dataset outputs. - Rsync download support — Pull files from running workflow tasks to your local machine with
osmo workflow rsync download, complementing the existing upload capability. - Visual transfer progress — File sync operations now display a progress bar showing bytes transferred, percentage, rate, and ETA.
- Workload identity for core services — Run OSMO services under a cloud-issued federated identity (Azure Workload Identity on AKS/Arc, AWS IRSA / EKS Pod Identity) via new cloud-neutral
serviceAccountannotations and per-componentextraPodLabelshooks, removing the need to mount cloud storage keys as Kubernetes Secrets. - Privilege escalation fix — Policies with empty resources lists no longer grant access to resource-scoped endpoints.
Breaking Changes
- Router chart removed: The standalone
routerHelm chart is gone. Router pods now deploy as part of theservicechart. Existing router resources (osmo-router,osmo-router-headless) continue to work, but you must remove the separate router Helm release before upgrading. See the 6.2 to 6.3 upgrade guide for migration steps. (#897) - Web UI chart removed: The standalone
web-uiHelm chart has been merged into theservicechart. Setui.enabled: truein service values to deploy the UI alongside the API. Remove the separateweb-uirelease before upgrading. (#907) - Squid proxy removed from backend operator: The egress allowlist and squid-proxy sidecar have been removed from the backend operator chart. Network policies now restrict pod-to-pod access directly. (#823)
- Per-group timeout semantics:
exec_timeoutandqueue_timeoutare now enforced per group (clock starts on the group'sRUNNINGorSCHEDULINGtransition) instead of per workflow. An expired group is markedFAILED_EXEC_TIMEOUTorFAILED_QUEUE_TIMEOUT; sibling groups continue and the workflow status aggregates only after all groups finish. (#925) - Dataset CLI and API deprecated: All
osmo datasetsubcommands print a stderr deprecation warning, and the/datasetsREST endpoints are marked deprecated in the OpenAPI schema. The Datasets page in the UI shows a deprecation banner. Both will be removed in 6.4. (#872) - S3 addressing default: For S3-compatible backends with a custom
endpoint_url, the addressing style now defaults to virtual-hosted instead of boto3's auto-selection (which picks path style for custom endpoints), fixing compatibility with providers that require virtual hosts. If a backend requires path addressing, set theaddressing_styleattribute to path, or force OSMO to always use path addressing via theAWS_S3_FORCE_PATH_STYLEenvironment variable. (#950)
Helm Charts
-
ConfigMap configuration mode: Set
services.configs.enabled: trueto manage all service configs via Helm values. CLI/API writes return HTTP 409 when active. The chart ships with default roles, pod templates, resource validations, backend, and pool. (#822) -
ConfigMap mode for worker, agent, and logger: The ConfigMapWatcher now runs in the worker, agent, and logger services. Previously only the API service watched the ConfigMap, so workflow pods built by the worker could be constructed from stale config. (#926)
-
TLS termination at the gateway: Configure a serving cert/key, optional HTTP-to-HTTPS redirect, and SAN list via
gateway.tls. The gateway template generates the matching Envoy listener config. (#953) -
Cloud workload identity: New top-level
serviceAccountblock (create,name,annotations) and per-componentextraPodLabelsonagent,api,worker,logger,router, anddelayedJobMonitor. The hooks are cloud-neutral — set the annotations and labels your CSP's identity webhook expects:- Azure (AKS / Arc): annotate the SA with
azure.workload.identity/client-id: <uami-client-id>and label pods withazure.workload.identity/use: "true". The Azure storage backend falls back toDefaultAzureCredentialwhen no static connection string is supplied. - AWS (EKS IRSA / Pod Identity): annotate the SA with
eks.amazonaws.com/role-arn: <iam-role-arn>. The S3 backend picks up the federated token from boto3's default credential chain — no pod labels required.
- Azure (AKS / Arc): annotate the SA with
-
Gateway consolidation: A unified gateway now handles load balancing for all service types (API, router, UI), simplifying ingress configuration. (#817, #799)
-
Gateway extension hooks: Inject custom Envoy filters and additive auth-skip paths via
gateway.envoy.extensionsandgateway.envoy.authSkipPaths, useful for sidecar integrations and bypassing authz on specific endpoints. (#1009) -
Default identity headers: Minimal deployments can now inject default
x-osmo-user,x-osmo-roles, andx-osmo-allowed-poolsheaders for unauthenticated browser requests viagateway.envoy.defaultIdentityvalues. (#902) -
oauth2-proxy extraEnv: Expose environment variables on the oauth2-proxy container via
gateway.oauth2Proxy.extraEnv, needed for Redis AUTH when using session storage. (#898) -
Custom HPA metrics: Specify custom metrics for Horizontal Pod Autoscalers on service components. (#858)
-
Pool computed fields resolved at load time: ConfigMap pools no longer require pre-expanded
parsed_pod_templateandparsed_resource_validations, reducing config file size by ~60%. (#866) -
Per-field Secret mounts: Create credential Secrets with
kubectl --from-literalinstead of packaging all fields into a singlecred.yaml. (#884) -
Default pod templates on default pool: The chart's default pool now sets
common_pod_template, so workflows submitted without an explicit template pick updefault_ctrlanddefault_userautomatically. (#1010, #1012) -
Backend-operator startup probe configurable:
startupProbethresholds on the backend listener and worker are now exposed in values, with relaxed defaults to handle slow image pulls on cold clusters. (#961) -
Service startup probe extended: The API service
startupProbefailure threshold now allows up to ~2 minutes for migrations and DB warm-up before the pod is restarted. (#967) -
podMonitor disabled by default: Both the service and backend-operator charts now default
podMonitor.enabledtofalse, avoiding errors on clusters without Prometheus Operator CRDs installed. (#962, #963) -
Config export script: New
deployments/upgrades/export_configs_to_helm.pyexports existing database configs to Helm values format. (#866)
Deployment Scripts
- Multi-provider deploy:
deploy-k8s.shprovisions a Kubernetes cluster on Azure AKS, AWS EKS, microk8s, or registers an existing cluster, then installs OSMO end-to-end. Cluster-agnostic dependency installers detect existing KAI Scheduler, GPU Operator, and MinIO so re-runs are safe. (#979) - Storage backend wiring:
configure-storage.shprovisions and registers the workflow storage backend for MinIO, Azure Blob, AWS S3, or a bring-your-own S3 endpoint, including credential creation and bucket setup. (#979, #988) - Idempotent token mint: Backend operator token reconciliation now deletes any pre-existing
backend-tokenbefore re-minting, so partial prior runs and microk8s PVC carryover no longer wedge re-deploys. (#988) - Helm values for minimal install:
deploy-osmo-minimal.shaccepts--valuesto layer custom Helm values on top of the minimal preset. (#993)
Workflow Execution
- Per-group exec and queue timeouts: Each group's clock starts on its own
RUNNING(exec) orSCHEDULING(queue) transition. Expired groups are markedFAILED_EXEC_TIMEOUTorFAILED_QUEUE_TIMEOUT; downstream groups cascade asFAILED_UPSTREAM, sibling groups keep running. Delayed jobs serialized before the upgrade fall back to the previous workflow-level enforcement with a warning log. (#925) - Pool quota accounting handles Jinja:
osmo-ctrlresource requests and limits are now pre-rendered for pool-quota accounting, so templated values like{% if USER_CPU > 2 %}2{% else %}{{USER_CPU}}{% endif %}are counted correctly instead of being silently treated as zero. (#931) - service_auth wired into worker, agent, logger: These services now read
service_authand stop readingservice_base_urlfrom the database, fixing config-mode authentication for non-API pods. (#930) - KAI queues sync on every registration: Backend registration now syncs KAI Scheduler queues unconditionally, instead of only on the first registration. (#941)
CLI
- Rsync download: Pull files from running tasks to your local machine with
osmo workflow rsync download wf-id /remote/path:/local/path. (#792) - Rsync errors when remote source is missing: Downloads now fail loudly when the requested remote path doesn't exist on the task pod. Previously the rsync daemon exited 0 with zero files transferred and the CLI reported success while leaving the destination empty. (#1019)
- Rsync shutdown error fixed: The spurious
ValueError: Invalid file descriptor: -1after a successful rsync download is gone. (#987) - Transfer progress bar: Rsync upload and download now display an in-place progress bar showing bytes, percentage, rate, and ETA. Suppress with
--no-progress. (#826) - Structured JSON logs: Pass
--log_format json(or set the equivalent values key on services) to emit single-line JSON logs compatible with Fluent Bit. (#888) - Uninstall script: Remove OSMO CLI with
osmo-uninstall(macOS/Linux). (#710) - Agent skill prompt: The installer offers to install the OSMO agent skill for AI coding assistants during CLI installation. (#841)
- Token expiry warning: The CLI warns when your access token is within 24 hours of expiring. (#711)
- Token roles nargs:
osmo token set --rolesnow accepts multiple roles as separate arguments instead of requiring a comma-separated list. (#754) - Dataset commands deprecated:
osmo dataset *subcommands print a stderr deprecation warning. The commands and corresponding/datasetsAPI will be removed in 6.4. (#872)
Web UI
- Workflow version navigation: Navigate between workflow run versions using back/forward arrows in the details panel. (#834)
- Task failure messages: Failed and canceled tasks now display their
failure_messagein the Details section, even when exit code is null. (#832, #833) - Sign out via oauth2-proxy: The Sign Out action now routes through the oauth2-proxy logout endpoint so the upstream session is cleared, not just the local cookie. (#996)
- Exec cookies fix for multi-router deployments: Exec session cookies are now scoped correctly when multiple router services are running, so terminal sessions stay attached to the right backend. (#1003)
- Datasets deprecation banner: The
/datasetspage shows a deprecation banner announcing the v6.4 removal. (#872) - Terminal resize: The web shell now responds to window resizes, fixing display issues with applications like vim. (#727, #717)
- Filter and retry fixes: Resolved issues with workflow log filters, task retry display, and occupancy search fields. (#784)
- Next.js 16.2.4 / Node 24.14.1: The UI now ships on Next.js 16.2.4 and Node 24.14.1. (#949)
- CodeMirror deduplicated:
@codemirror/stateis now deduped to a single version, fixing intermittent editor crashes. (#955)
Authorization
- Privilege escalation fix: Policies with an empty
resourceslist now correctly match only unscoped endpoints. Previously, they incorrectly granted access to all resource-scoped paths (e.g., a user withauth:Tokenand no resources could manage other users' tokens). (#867) - Default role sync: A default role is now created before authz sidecar sync, preventing startup failures when no roles exist. (#791)
- Default pool submission: Users with the
osmo-userrole can submit workflows to the default pool without explicit pool assignment. (#728)
Configuration Validation
- Fail-fast startup: Pods crash-loop on malformed ConfigMap at startup instead of silently falling back to database mode. K8s rolling updates stall at the bad revision while healthy pods continue serving. (#889)
- K8s Events on reload failure: When hot-reload validation fails, the loader emits a
ConfigMapReloadFailedWarning event on the ConfigMap, visible viakubectl describe configmapand cluster monitoring. (#889) - Secret redaction in errors: Pydantic validation errors no longer echo secret values. Error messages show field path and type only. (#891)
Performance
- Concurrent workflow file uploads: Workflow file uploads now run concurrently, reducing submit latency for workflows with many files. (#783)
- Batch group and task inserts: Workflow submit now inserts all groups and tasks in a single atomic transaction instead of per-group calls. (#821)
- Bulk query optimization:
Workflow.fetch_from_db()reduced from 2N+2 queries to 3 queries regardless of group count. (#820) - Barrier notification batching:
_notify_barrieruses a single SQL query and Redis pipeline instead of N individual fetches. (#756) - UpdateGroup performance: Task status aggregation uses a lightweight query instead of loading full task rows. (#742)
Bug Fixes
- Credential env var collision: Multiple credentials with the same payload key name (e.g., both use
key) no longer overwrite each other. Secret data keys are now namespaced with the credential name. (#839) - Credential names not masked: Credential names and field references (e.g.,
AWS_ACCESS_KEY_ID) are no longer incorrectly masked in workflow specs. (#744) - Dataset manifest sort: Fixed binary search mismatch in dataset manifest comparator. (#903)
- Dataset browsing from private buckets: Dataset URLs for S3-compatible backends are now built against the credential's
override_urlinstead of the AWS pattern, so the UI can fetch content from CAIOS, MinIO, and other non-AWS endpoints. (#957) - Storage credential setup errors: Clearer error messages when required fields are missing or malformed during credential creation. (#947)
- OpenAPI schema generation: API schema export works again after the Pydantic v2 migration. (#985)
- SSL truststore on Python 3.14 + microk8s: Patched
sslwithtruststoreso HTTPS calls from in-cluster pods on Python 3.14 + microk8s pick up the system trust store. (#951) - Web UI base image (CVE-2026-2673): Bumped the web-ui base image to v4.0.5 to pick up the upstream fix. (#971)
- Workflow file 403 handling: Streaming response now returns proper error when workflow file access is forbidden. (#730)
- Authz path fixes: Corrected authorization paths for rsync, workflow exec, and credential create operations. (#739, #738, #737)
Getting OSMO
Helm Charts and Containers
Helm charts and container images are available on NGC.
CLI Client
Installers for the CLI client for macOS (Apple Silicon), x86-64 Linux, and ARM64 Linux are attached as assets to this release.