Skip to content

Add AI Agentic Skills#555

Merged
ethany-nv merged 1 commit intomainfrom
ethany/osmo-skill
Feb 27, 2026
Merged

Add AI Agentic Skills#555
ethany-nv merged 1 commit intomainfrom
ethany/osmo-skill

Conversation

@ethany-nv
Copy link
Copy Markdown
Collaborator

Description

This PR adds an OSMO skill for Claude Code that enables natural-language interaction with the OSMO platform directly from the terminal.

What it does

The skill gives Claude contextual knowledge of the OSMO CLI, allowing it to autonomously run commands and interpret results on behalf of the user. It covers five core use cases:

  • Check available resources — Lists accessible GPU pools and computes true effective availability (accounting for both quota limits and physical capacity). Automatically surfaces LOW priority submission opportunities when quota is
    exhausted but GPUs are physically idle.
  • Generate and submit workflows — Generates valid OSMO workflow YAML from a natural-language description, referencing a built-in cookbook of 25+ real-world examples (SDG, RL training, GR00T fine-tuning, Cosmos, ROS2, remote dev
    environments, and more). Handles resource validation errors automatically by adjusting CPU/memory/storage to node capacity before resubmitting.
  • List workflows — Displays recent workflow submissions with status, pool, and duration in a clean table with clear status symbols.
  • Check workflow status and logs — Queries live status and fetches logs for running or completed workflows, with smart per-task log fetching for multi-task jobs. Offers to download output datasets on completion and translates
    Kubernetes scheduling events into plain language when a job is PENDING.
  • Explain what a workflow does — Fetches the original workflow spec and summarizes it in plain language, covering what it runs, how it's configured, and what it produces.

Reference documentation

The skill includes three reference files Claude consults when needed:

  • cookbook.md — Index of 25+ ready-to-use workflow YAML examples from the NVIDIA/OSMO GitHub repo
  • workflow-patterns.md — Patterns for multi-task, parallel, pipelined, and Jinja-templated workflows
  • advanced-patterns.md — Checkpointing, exit/retry behavior, and node exclusion

Issue #532

References:

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@ethany-nv ethany-nv requested a review from a team as a code owner February 27, 2026 01:36
@ethany-nv ethany-nv self-assigned this Feb 27, 2026
@ethany-nv ethany-nv merged commit c4effe6 into main Feb 27, 2026
19 checks passed
@ethany-nv ethany-nv deleted the ethany/osmo-skill branch February 27, 2026 02:12
elookpotts-nvidia added a commit that referenced this pull request Feb 27, 2026
* fix: remove role_arn from router cloudwatch log agent config (#430)

* fix: move backports-tarfile comment to its own line to prevent it from being embedded in wheel metadata (#541)

* Enable Workflow Events in CLI (#533)

* Enable Workflow Events in CLI

* Remove error events from workflow events CLI subcommand

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Remove last_n_lines argument from workflow events subcommand

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Remove old UI (#535)

* Remove //ui from github action

* Remove //ui

* Remove unneeded Node from Bazel

* Address follow ups during the //src/ui move (#536)

* Update to pnpm dev

* Give build/push scripts default arguments

* Add instructions

* Update BUILD_AND_TEST

* buildProductionCsp only if we are indeed a production build

* Add navigation progress bar for slow connections (#540)

* Refine Log Viewer (#539)

* Trim extra space at beginning of log

* Log-viewer expand/collapse row

* Clean up envoy response headers (#519)

* Fix resource table listing (#534)

* Fix resource table listing

* lint

* Lint CSS (#542)

* Lint CSS

* Fix css linting

* fix: default database schema, and redeployment schema drop (#544)

* update default database schema

* fix issues with re-deploy

* update schema version variable to be same across charts

* Allow setting _osmo_session cookies for local -> prod development (#548)

* Allow setting _osmo_session cookies for local -> prod development

* Format

* gh-pages: Switch to action deploy (#551)

For the repo size reduction #543 switch to using actions based
deployment of our github pages.

The pr-preview functionality will return in a later PR.

* Add AI Agentic Skills (#555)

* fix: upgrade fastapi to 0.125.0 to resolve starlette CVE (#556)

Upgrade fastapi from 0.115.5 to 0.125.0, which allows starlette
to resolve from 0.41.3 to 0.50.0, fixing GHSA-7f5h-v6ap-rcq8.

FastAPI 0.125.0 is the latest version that still supports pydantic v1,
maintaining compatibility with the existing pydantic==1.10.13 pin.

* Fix OTel collector memory pressure after SDK upgrade (#558)

Increase collector memory_limiter from 30 MiB to 128 MiB and reduce
metric export frequency from 6s to 15s to prevent data drops under
high-cardinality metric load.

---------

Co-authored-by: ethany-nv <ethany@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Fernando L <fernandol@nvidia.com>
Co-authored-by: RyaliNvidia <ryali@nvidia.com>
Co-authored-by: Vivian Pan <vivianp@nvidia.com>
Co-authored-by: Hans Arnholm <harnholm@nvidia.com>
Co-authored-by: tdewanNvidia <tdewan@nvidia.com>
RyaliNvidia added a commit that referenced this pull request Feb 28, 2026
* fix: remove role_arn from router cloudwatch log agent config (#430)

* fix: move backports-tarfile comment to its own line to prevent it from being embedded in wheel metadata (#541)

* Enable Workflow Events in CLI (#533)

* Enable Workflow Events in CLI

* Remove error events from workflow events CLI subcommand

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Remove last_n_lines argument from workflow events subcommand

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Remove old UI (#535)

* Remove //ui from github action

* Remove //ui

* Remove unneeded Node from Bazel

* Address follow ups during the //src/ui move (#536)

* Update to pnpm dev

* Give build/push scripts default arguments

* Add instructions

* Update BUILD_AND_TEST

* buildProductionCsp only if we are indeed a production build

* Add navigation progress bar for slow connections (#540)

* Refine Log Viewer (#539)

* Trim extra space at beginning of log

* Log-viewer expand/collapse row

* Clean up envoy response headers (#519)

* Fix resource table listing (#534)

* Fix resource table listing

* lint

* Lint CSS (#542)

* Lint CSS

* Fix css linting

* fix: default database schema, and redeployment schema drop (#544)

* update default database schema

* fix issues with re-deploy

* update schema version variable to be same across charts

* Allow setting _osmo_session cookies for local -> prod development (#548)

* Allow setting _osmo_session cookies for local -> prod development

* Format

* gh-pages: Switch to action deploy (#551)

For the repo size reduction #543 switch to using actions based
deployment of our github pages.

The pr-preview functionality will return in a later PR.

* Add AI Agentic Skills (#555)

* fix: upgrade fastapi to 0.125.0 to resolve starlette CVE (#556)

Upgrade fastapi from 0.115.5 to 0.125.0, which allows starlette
to resolve from 0.41.3 to 0.50.0, fixing GHSA-7f5h-v6ap-rcq8.

FastAPI 0.125.0 is the latest version that still supports pydantic v1,
maintaining compatibility with the existing pydantic==1.10.13 pin.

* Fix OTel collector memory pressure after SDK upgrade (#558)

Increase collector memory_limiter from 30 MiB to 128 MiB and reduce
metric export frequency from 6s to 15s to prevent data drops under
high-cardinality metric load.

* fix migration hook lifecycle (#560)

* Upgrade UI codegen tooling (Orval v7 -> v8) and regenerate (#557)

* Cleanup backend_todos 14, 17

* Fix backend_todo #3

* Orval v7 -> v8 migration + regenerate autogen code

* Format

* Use stronger types

* Remove unused import

* Fix osmo_barrier.py bug with num_nodes=1 (#561)

Fix bug where osmo_barrier.py hangs when num_nodes=1

* Fix oauth2-proxy TOML parse error when using Kubernetes secrets (#563)

* Tweak Cancel/Resubmit to gracefully handle related/unrelated errors (#564)

* Tweak Cancel/Resubmit to gracefully handle related/unrelated errors

* Add back refresh button into cancel toast

* Stabilize UI CI (#567)

* Add helm upgrade validation and remove deprecated values (#568)

* Add helm upgrade validation for 6.0 → 6.2 breaking changes

* Remove deprecated oauth2Filter and secretPaths from chart defaults

* feat: datasets collections, file browser overhaul, and mock fidelity improvements (#569)

* fix: use push history for file browser path navigation to enable back button

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: surface S3 URI through DatasetFile type

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: use basePath-aware proxy, expand text type support, and copy S3 path in file preview

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: replace double-click with always-visible leading open-panel button in datasets table

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: add Browse files button and clickable version rows to dataset details panel

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: move copy button to fixed leading column in file browser, always visible, copy S3 path

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: replace version switcher dropdown with prev/next nav and Details panel on file browser page

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: add Home > Datasets prefix links to file browser breadcrumb

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: collapse deep file browser breadcrumb paths with ellipsis

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: sibling folder popover on breadcrumb segment click

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: add MidTruncate component and apply to dataset and file name columns

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* style: format ternary in sibling popover button

* refactor: move open-details button inline in name cell, right-aligned with tooltip

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: hoist dataset details panel to layout level

The details slideout is now mounted once at the /datasets/** route
layout, so it persists across navigation between the list and the file
browser pages. Clicking "Browse files" or a version in the panel no
longer closes and reopens the panel.

Changes:
- Add datasets-panel-store.ts: ephemeral Zustand store (bucket/name/isOpen)
- Add datasets-panel-context.tsx: passes isPanelOpen/openPanel/closePanel to pages
- Add datasets-panel-layout.tsx: ResizablePanel + DatasetPanel at layout level
- Add src/app/(dashboard)/datasets/layout.tsx: Next.js route layout
- DatasetsPageContent: remove local ResizablePanel, use store + context
- DatasetDetailContent: remove outer ResizablePanel, use context for Details toggle
- DatasetPanel: revert accidental ?details=true navigation param

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: move copy-path button inline in file browser name cell

* fix: breadcrumb last segment truncation and copy tooltip confirmation

* feat: merge dataset file browser header into chrome nav

* feat: add onFocusedRowChange callback and j/k/l vim bindings to DataTable

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: add keyboard navigation to dataset file browser

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: autoplay and loop video in file preview panel

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: add CollectionMember type and discriminated DetailResponse to datasets adapter

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: add Type column to datasets list table with Collection badge

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: add CollectionPanelMembers and update DatasetPanel to handle collections

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: generalize VersionSwitcher and FileBrowserControls to generic SwitcherItem[]

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: render collection members as top-level entries in the file browser

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: add collection mock data and interleaved list/info handlers

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: add mock file manifest for dataset file browser in dev:mock

* fix: route location-files through MSW in mock mode via impl split pattern

* fix: use text/json files in mock manifest so preview panel can render them

* feat: add private dataset 401 and file-proxy MSW interception in mock mode

* fix: isolate copy-path tooltip state per button and add s3 storage_path to mock manifests

- PreviewError now owns its own useCopy() instance so clicking its "Copy path"
  button doesn't also trigger the header copy tooltip
- generateFlatManifest accepts optional locationBase and populates storage_path
  on every RawFileItem so the Copy button appears in the file browser table
- Pass locationUrl from the location-files MSW handler into generateFlatManifest
- Replace http.head + http.get file-proxy handlers with http.all to fix HEAD
  interception failure through the mock port-9999 tunnel

* style: format server-mock-utils.ts

* fix: remove copy path button from error states in file preview panel

* fix: match file browser table header height and border to preview panel header

* style: format data-table and file-preview-panel

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix pool quota api logic and add tests (#570)

* Add RBAC to docs + move keycloak to optional (#562)

* Remove keycloak from docs

* update pat

* lint

* lint

* spell

* Update docs for rbac

* lint

* clean

* update grid;

* update doc

* update names

* remove

* update

* Add keycloak in the appendix

* remove

* Update docs/deployment_guide/appendix/authentication/identity_provider_setup.rst

Co-authored-by: Vivian Pan <vivianp@nvidia.com>

* comments

* more comments

---------

Co-authored-by: Vivian Pan <vivianp@nvidia.com>

---------

Co-authored-by: Ethan Look-Potts <elookpotts@nvidia.com>
Co-authored-by: ethany-nv <ethany@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Fernando L <fernandol@nvidia.com>
Co-authored-by: Vivian Pan <vivianp@nvidia.com>
Co-authored-by: Hans Arnholm <harnholm@nvidia.com>
Co-authored-by: tdewanNvidia <tdewan@nvidia.com>
Co-authored-by: ecolternv <ecolter@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants