Skip to content

(feat) rvs: implement SOT artifact caching pipeline and HTTP cache server#1813

Open
maxo-nv wants to merge 1 commit into
NVIDIA:mainfrom
maxo-nv:feature/rvs_sot_artifact_rebase0
Open

(feat) rvs: implement SOT artifact caching pipeline and HTTP cache server#1813
maxo-nv wants to merge 1 commit into
NVIDIA:mainfrom
maxo-nv:feature/rvs_sot_artifact_rebase0

Conversation

@maxo-nv
Copy link
Copy Markdown
Contributor

@maxo-nv maxo-nv commented May 19, 2026

Description

This patch adds the artifact pre-caching pipeline described in #416.
Before each validation cycle, RVS now fetches the SOT JSON, resolves
artifact URLs (direct URIs and JSONPath-based sotpath expressions),
and downloads them concurrently into a local cache directory. An HTTP
file server then serves the cache to nodes during validation. The
server starts once before the main loop so it stays alive across
cycles; new files written by each download pass become visible
immediately without a restart.

Downloads are bounded by a configurable semaphore, respect a per-file
timeout, and verify integrity against the SHA-256 checksum advertised
by Artifactory in the x-checksum-sha256 response header. Files
already on disk are skipped on subsequent cycles.

Artifact URL resolution lives in a new scenario/resolver.rs module
and is pure (no I/O), making it straightforward to unit test. JSONPath
evaluation handles sotpath expressions like
$.BoardSKUs[?@.Name == '...'].Components.Software[?@.Component == '...'].Locations[?@.Name == '...'].Location.
The SOT is fetched from NICC via list_rack_firmware and matched by
the Name field against the scenario's sot_release. A file-based
override on RvsCtx keeps the full pipeline exercisable without a
live NICC connection. Multi-SOT support (scenarios targeting different
releases in the same cycle) is left as a follow-up TODO.

The crate is restructured to have a lib target so the artifact,
scenario, and context types are shareable across binaries. A new
test-artifact-cache binary wires up the complete pipeline against a
local SOT file and serves the resulting cache — useful for manual
verification and showing colleagues how the pieces fit together.

Testing

The test-artifact-cache binary was run against a real SOT JSON file
and a hand-crafted scenario TOML targeting release 1.2.2. The scenario
exercises three artifact kinds: an OS image placeholder, a direct-URI
artifact, and a sotpath-resolved artifact. A fourth large artifact
(~1.9 GB NVOS binary) was included to validate concurrent streaming
and checksum verification.

The SOT and scenario files are not committed (kept alongside the
upstream SOT JSON used for development). To reproduce, supply any SOT
JSON and a matching scenario TOML:

  target/debug/test-artifact-cache \
    --sot <path/to/sot.json> \
    --scenario <path/to/scenario.toml> \
    --cache-dir /tmp/rvs-test-cache \
    > /tmp/rvs-test.log 2>&1 &
  tail -f /tmp/rvs-test.log

After downloads completed the cache directory contained:

  total 3808848
  -rw-r--r--  1 user  root    31K  nmx-m-nmx-c.proto
  -rw-r--r--  1 user  root   1.8G  nvos.bin
  -rw-r--r--  1 user  root   1.7M  nvos_openapi.json
  -rw-r--r--  1 user  root   1.7M  os

All checksums passed. The server correctly served all files from
http://localhost:8080/gb200nvl/1.2.2/<filename>.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

#1653

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@maxo-nv maxo-nv requested a review from a team as a code owner May 19, 2026 16:53
@maxo-nv maxo-nv force-pushed the feature/rvs_sot_artifact_rebase0 branch from c3fc202 to c31ef62 Compare May 19, 2026 22:53
This patch adds the artifact pre-caching pipeline described in NVIDIA#416.
Before each validation cycle, RVS now fetches the SOT JSON, resolves
artifact URLs (direct URIs and JSONPath-based `sotpath` expressions),
and downloads them concurrently into a local cache directory. An HTTP
file server then serves the cache to nodes during validation. The
server starts once before the main loop so it stays alive across
cycles; new files written by each download pass become visible
immediately without a restart.

Downloads are bounded by a configurable semaphore, respect a per-file
timeout, and verify integrity against the SHA-256 checksum advertised
by Artifactory in the `x-checksum-sha256` response header. Files
already on disk are skipped on subsequent cycles.

Artifact URL resolution lives in a new `scenario/resolver.rs` module
and is pure (no I/O), making it straightforward to unit test. JSONPath
evaluation handles `sotpath` expressions like
`$.BoardSKUs[?@.Name == '...'].Components.Software[?@.Component == '...'].Locations[?@.Name == '...'].Location`.
The SOT is fetched from NICC via `list_rack_firmware` and matched by
the `Name` field against the scenario's `sot_release`. A file-based
override on `RvsCtx` keeps the full pipeline exercisable without a
live NICC connection. Multi-SOT support (scenarios targeting different
releases in the same cycle) is left as a follow-up TODO.

The crate is restructured to have a lib target so the artifact,
scenario, and context types are shareable across binaries. A new
`test-artifact-cache` binary wires up the complete pipeline against a
local SOT file and serves the resulting cache — useful for manual
verification and showing colleagues how the pieces fit together.

Testing
-------

The `test-artifact-cache` binary was run against a real SOT JSON file
and a hand-crafted scenario TOML targeting release 1.2.2. The scenario
exercises three artifact kinds: an OS image placeholder, a direct-URI
artifact, and a `sotpath`-resolved artifact. A fourth large artifact
(~1.9 GB NVOS binary) was included to validate concurrent streaming
and checksum verification.

The SOT and scenario files are not committed (kept alongside the
upstream SOT JSON used for development). To reproduce, supply any SOT
JSON and a matching scenario TOML:

  `target/debug/test-artifact-cache \`
    `--sot <path/to/sot.json> \`
    `--scenario <path/to/scenario.toml> \`
    `--cache-dir /tmp/rvs-test-cache \`
    `> /tmp/rvs-test.log 2>&1 &`
  `tail -f /tmp/rvs-test.log`

After downloads completed the cache directory contained:

  total 3808848
  -rw-r--r--  1 user  root    31K  nmx-m-nmx-c.proto
  -rw-r--r--  1 user  root   1.8G  nvos.bin
  -rw-r--r--  1 user  root   1.7M  nvos_openapi.json
  -rw-r--r--  1 user  root   1.7M  os

All checksums passed. The server correctly served all files from
`http://localhost:8080/gb200nvl/1.2.2/<filename>`.

Signed-off-by: Max Olender <molender@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant