Releases: NVIDIA/ai-cloud-validation
Releases · NVIDIA/ai-cloud-validation
M6 Release
Immutable
release. Only release title and notes can be modified.
M6'26 - Q2'26
attestation
- Issue #315: SEC22-01: Check that hardware passes nonce check
- Issue #316: SEC22-02: Check that TPM Configuration can be set per host
- Issue #319: ATTEST-XX-02: Check that an updated BIOS is installed on all hardware
Benchmarking
Capacity-Reservations
- Issue #252: CAP04-01: Verify a set of resources can be logically grouped and pinned to an account
- Issue #253: CAP04-02: Verify atomic allocation of a topology block as a single unit
Compute Services: BMaaS
- Issue #134: SEC21-02: Confirm sanitization on delete - Low Level
- Issue #136: Check the health and status of the DPU on instantiation
- Issue #312: SEC21-04: Verify memory sanitization between tenants
Compute Services: VMaaS
- Issue #313: SEC21-05: Verify SRAM/GPU memory is sanitized between tenants
- Issue #314: SEC21-06: Verify TPM and BIOS are reset during tenant transitions or hardware replacement
enhancement
Governance-Metrics
- Issue #251: CAP01-01: Query and verify governance API returns Delivered, Healthy, Reserved, and Active metrics for nodes/GPUs
hardware-ingestion
- Issue #133: Makes sure that all hardware under test has been ingested, and matches the provided hardware
image-registry
key-secret-mgmt
kubernetes-control-plane
- Issue #267: K8S-XX-08: Verify Cluster Autoscaler integration (upstream)
- Issue #277: K8S26-01: Support multiple clusters in the same tenancy and in the same VPC
network-security
- Issue #287: SDN04-04: Verify IB tenant isolation — compute dedicated to NVIDIA is isolated from other customers
- Issue #288: SDN04-05: Verify IB keys are configured: P_Key, Management Key, Aggregation Management Key, VendorSpecific Key, CongestionControl Key, Node2Node Key, Manager2Node Key
- Issue #310: SEC13-02: Verify insecure protocols (HTTP, SSLv3, TLSv1) are disabled
object-storage-service
sdn-controller
- Issue #284: SDN02-08: Measure and verify policy propagation timing is within acceptable bounds
- Issue #286: SDN02-10: Verify customizable port security policies can be applied to virtual interfaces
- Issue #289: SDN08-01: Verify that storage hosts can route to each other with all-to-all L3 communication
sds-controller
- Issue #263: DMS01-01: Verify a dedicated K8s cluster (or ability to create one) for the data mover stack
- Issue #264: DMS02-01: Verify dedicated CPU nodes are available for data mover with high-performance networking
- Issue #265: DMS03-01: Verify the same filesystem mounted on GPU nodes is accessible from data mover nodes
- Issue #266: DMS05-01: Verify stable egress IP for allowlisting access to NVIDIA services
Unified-Health-APIs
- Issue #254: CAP05-01: Verify per-host health API returns real-time GPU state, thermal status, memory health
- Issue #255: CAP05-02: Verify primitive-level health aggregation (cluster, nodegroup, or reservation level)
Misc
Total: 32 items (0 PRs, 32 issues)
M5 Release
Immutable
release. Only release title and notes can be modified.
M5'26 - Q2'26
attestation
audit-logging
- Issue #299: SEC08-01: Perform a management API call; verify an audit log entry is generated with correct metadata
- Issue #300: SEC08-02: Verify audit logs are retained for at least 30 days
authentication
- Issue #293: SEC01-01: Verify user authentication via OIDC for platform services
- Issue #294: SEC02-01: Verify workloads and nodes receive short-lived credentials/tokens
- Issue #295: SEC03-01: Verify out-of-cluster service accounts can authenticate with long-lived credentials
- Issue #298: SEC07-01: Verify all administrative interfaces (UI, CLI, API) are protected by Multi-Factor Authentication
authorization
- Issue #297: SEC04-02: Assign a minimal role to a user; verify they cannot perform actions outside that role on Compute, Storage, and Network APIs
backend-switch-fabric-api
- Issue #278: NET01-01: Query the API for a compute node and verify it returns backend switch IDs (leaf, spine, core)
block-storage-services
Compute Services: BMaaS
- Issue #250: BMAAS-XX-07: Check for any per-host status log over time.
- Issue #258: CNP06-02: Verify that serial console output is logged and queryable for at least 1 month of history
Compute Services: VMaaS
- Issue #256: CNP01-16: Verify console access is restricted via RBAC
- Issue #257: CNP01-17: Verify USB, clipboard, and unnecessary virtual devices are disabled
hardware-security-compliance
- Issue #303: SEC09-03: Verify a centralized KMS is used for all encryption keys and secrets
- Issue #304: SEC09-04: Verify support for Customer Managed Keys (BYOK)
kubernetes-control-plane
- Issue #269: K8S06-01: Create a K8s node pool via API/CLI specifying node type (CPU or GPU instance type)
- Issue #270: K8S06-02: Update a K8s node pool (e.g., scale to a target count)
- Issue #271: K8S15-01: Verify the K8s API endpoint has network access controls (firewall/private link)
- Issue #273: K8S23-04: Verify CSI supports block, shared filesystem, and NFS storage
- Issue #274: K8S23-05: Verify CSI supports both static and dynamic provisioning via PVs and PVCs
- Issue #275: K8S23-06: Verify CSI credentials are tenant cluster scoped (no cross-cluster access)
- Issue #276: K8S23-07: Verify APIs to query storage usage against overall cluster quota with per-PVC/Volume breakdown
- Issue #301: SEC09-01: Verify certificates are rotated on a 60-day cycle
- Issue #302: SEC09-02: Verify support for both provider-managed and customer-managed keys
network-security
- Issue #260: CNP10-01: Verify IPMI is disabled; Redfish over TLS is used with AAA
- Issue #305: SEC11-01: Verify hard physical or logical isolation between tenants for network, data, compute, and storage resources
- Issue #306: SEC12-01: Verify BMC management is on a dedicated, restricted network (physically separate or VLAN/VRF-isolated)
- Issue #307: SEC12-02: Verify BMC interfaces are not reachable from tenant networks
- Issue #308: SEC12-03: Verify BMC is only accessible via a hardened bastion (jumphost) server; direct public/corporate network access is blocked
- Issue #311: SEC14-01: Verify no public internet access to API endpoints by default
nvlink-domain-api
sdn-controller
- Issue #281: SDN02-05: Verify security group rules can be scoped at workload level
- Issue #282: SDN02-06: Verify security group rules can be scoped at node level
- Issue #283: SDN02-07: Verify security group rules can be scoped at subnet/tenant level
- Issue #285: SDN02-09: Verify security group rules can be scoped at K8s API service level
- Issue #290: SDN09-01: Verify logging is available for network hardware faults
- Issue #291: SDN09-02: Verify logging captures latency/performance fluctuations
- Issue #292: SDN09-03: Verify a detailed audit trail exists for all configuration changes to network filtering rules
sds-controller
- Issue #296: SEC04-01: Verify least-privilege access policies (resource-based, user-based, network-based)
Total: 40 items (0 PRs, 40 issues)
M4 Release
Immutable
release. Only release title and notes can be modified.
M4'26 - Q2'26
Compute Services: BMaaS
- Issue #135: A running node can be power-cycled in case of issue
- Issue #181: Add tag validation to bare-metal test suite (CNP05)
Foundational Services
- Issue #131: Test case: Check that IP addresses are managed by the platform, and that DHCP is able to provide IP addresses
- Issue #132: Check that IP addresses are managed sensibly as VPCs are configured (Move to SDN section?)
sdn-controller
Total: 5 items (0 PRs, 5 issues)
M3 Release
M3 -Q2'26
Compute Services: BMaaS
- Issue #105: Topology-based placement
- Issue #106: A running node can be powered off (Without being destroyed)
- Issue #107: Support for cloud-init and instance metadata discovery via link-local addresses (e.g. 169.254.169.254) or virtual devices
- Issue #108: A powered off node can be powered back on
- Issue #109: Serial console access is required (read-only sufficient, interactive preferred)
Compute Services: VMaaS
- Issue #110: A running VM can be stopped (Without being destroyed)
- Issue #111: A powered off VM can be started
- Issue #112: Support for user-defined tags/labels and cloud-init metadata on instances.
- Issue #113: Support for cloud-init and instance metadata discovery via link-local addresses (e.g. 169.254.169.254) or virtual devices
- Issue #114: Serial console access is required (read-only sufficient, interactive preferred)
image-registry
- Issue #104: CRUD (get, list, create, delete) custom OS images for fast boot (raw, qcow2, img, etc) for BM/VM
observability
- Issue #120: Read only access to a BMaaS system's serial console
- Issue #121: Read only access to a VM serial console
sdn-controller
- Issue #115: Support non-conflicting Bring-Your-Own-IP (including 7.0.0.0/8)
- Issue #116: Support Stabe Private IP allocations, where if a VM crashes and restarts the same IP address remains pinned until the node is deleted
- Issue #117: Atomically switch a floating private IP between nodes via API within <10 seconds without requiring an instance reboot.
- Issue #118: Localized DNS: Support for custom, localized DNS settings to enable internal domain resolution to private endpoints (e.g. storage endpoints)
- Issue #119: Support for VPC peering with full bandwidth and no "hairpin" routing.
Total: 18 items (0 PRs, 18 issues)
M1 Release
M1-Q1'26
Compute Services: BMaaS
- Issue #49: CRD: Create a Bare Metal node
- Issue #51: CRD: Get info about a specific Bare Metal node, including if it is ready for use
- Issue #52: CRD: Get a list of all Bare Metal nodes in a VPC
- Issue #53: CRD: Delete a Bare Metal node
- Issue #54: Confirm sanitization on delete - Tenant View
- Issue #56: A running node can be accessed via SSH/Teleport for further configuration
- Issue #57: A running node can be rebooted in case of issue
- Issue #58: A node can be reinstalled from its configured stock operating system
- Issue #60: Nodes can communicate across ethernet, infiniband, and NVLink
- Issue #61: Verify NVIDIA hardware
- Issue #63: GPU Stress workload
- Issue #64: Nim Inference jobs
- Issue #65: NCCL tests
- Issue #66: Training workload test
Compute Services: VMaaS
- Issue #45: CRD: Get a list of all VM nodes in a VPC
- Issue #69: A node can be reinstalled from its configured stock operating system
- Issue #82: Nim Inference jobs
image-registry
- Issue #39: CRUD an OS install configuration
- Issue #41: Check that an OS image can be installed on a BMaaS system
- Issue #43: Check that an OS install configuration can be installed on a BMaaS system
Bug fixes
- fix: update NCCL image reference in settings and Kubernetes manifest by @abegnoche in #90
Full Changelog: v0.4.1...v0.4.2
M0 Release
Release Notes
Compute Services: VMaaS
- Issue #42: CRD: Create a VM node
- Issue #46: CRD: Delete a VM node
- Issue #47: Confirm sanitization on delete
- Issue #48: A running node can be accessed via SSH/Teleport for further configuration
- Issue #50: A running node can be rebooted in case of issue
- Issue #55: Check that the host OS is a known acceptable image (like Ubuntu/DGXOS)
- Issue #59: Check that the GPU is visible and accessible from the VM
- Issue #62: Check that the correct Linux Kernel, libvirt, sbios, and NVIDIA drivers are installed on host OS
- Issue #67: Check that vCPU pinning is set correctly, PCI bus is configured correctly on host OS
- Issue #68: GPU Stress workload
Workload-orchestration-K8s
- Issue #5: Provision K8s Cluster
- Issue #6: Test K8s/NVIDIA software/drivers
- Issue #7: Network validation
- Issue #8: NIM/TensorRT validation
- Issue #9: NIM/TensorRT validation at scale
- Issue #10: Refactor test framework
Workload-orchestration-Slurm
- Issue #22: Check SLURM ingestion
- Issue #24: Check job execution to Ensure that all nodes can run slurm jobs
- Issue #25: Run single node training job to ensure that all nodes are able to run a serious computation job using the GPUs
- Issue #36: Run multi-node training job to Ensure that serious computation job can be run across multiple nodes and benefit from NVLINK/etc to accelerate that job
control-plane-accessible
- Issue #27: Make sure we can ping the control plane, or get a heart beat/status code
- Issue #28: Access keys can be created, received, used to log in
- Issue #29: Access keys can expire or be disabled
- Issue #30: CRD: Create tenants
- Issue #31: CRD: Retrieve list of tenants
- Issue #32: CRD: Retrieve info about individual tenant
- Issue #33: CRD: Retrieve info about individual tenant
- Issue #34: CRD: Delete Tenant
- Issue #35: Add user to tenant
identity-asset-management
image-registry
- Issue #38: CRUD an OS image (iso file) (Upload an ISO file if supported, or remote URL if supported)
- Issue #40: CRUD a VM image
- Issue #44: Check that a VM image can be installed onto a VMaaS system
sdn-controller
- Issue #11: CRUD: Create a VPC
- Issue #12: CRUD: Retrieve a VPC
- Issue #13: CRUD: Update a VPC
- Issue #14: CRUD: Delete a VPC
- Issue #16: Verify that a VPC has an exclusive subnet
- Issue #17: Verify that nodes in two VPCs cannot communicate over the N/S (TAN) network
- Issue #18: Verify that nodes in two VPCs cannot communicate over the E/W (CIN) network
- Issue #19: NVLink Partitioning