Skip to content

Release v0.5.0

Choose a tag to compare

@github-actions github-actions released this 08 Dec 13:32
· 381 commits to main since this release
v0.5.0
6cce9d6

Release v0.5.0

This release focuses on extensibility, production hardening, and operational flexibility. We've added support for custom drain handlers, PostgreSQL as an alternative database backend, comprehensive audit logging, and expanded our XID detection and remediation capabilities.

🎯 Major New Features

Custom Drain Extensibility

NVSentinel now supports custom drain handlers, allowing integration with specialized workload orchestrators. This feature enables organizations running HPC schedulers like Slinky, big data frameworks like Volcano, or ML platforms like Ray to integrate their custom drain logic seamlessly. The release includes a complete demo environment showcasing custom drain integration.

PostgreSQL Database Backend

Added PostgreSQL as a production-grade alternative to MongoDB, providing more flexibility in database selection. This addresses licensing concerns, operational preferences, and allows better alignment with existing infrastructure standards.

Note: Support for PostgreSQL is experimental and it is not recommended in production clusters

Audit Logging

Comprehensive audit logging for all NVSentinel write operations enables compliance reporting, security analysis, and operational troubleshooting. Every mutation is tracked with context about what changed, when, and by which component. The structured audit logs support configurable retention and rotation, with formats ready for integration with SIEM systems.

🔧 Enhanced Fault Detection & Remediation

XID 13 & XID 31 Workflow Implementation

Automated workflows for handling critical GPU error conditions for XID 13 and 31. These workflows help catch GPU degradation early.

XID 154 Support

Added support for detecting and handling XID 154 (GPU Recovery Action Changed) events.

Pre-Installed Driver Support

Enhanced support for environments with driver installed outside of GPU operator

🏗️ Infrastructure & Architecture Improvements

ko-based Kubernetes Object Monitor

Migrated the Kubernetes object monitor to ko-based builds, resulting in faster build times for development iterations, smaller container images with reduced attack surface, and improved supply chain security with minimal base images.

Enhanced Build System

Version field is now properly passed from build args to Dockerfile for accurate version reporting, improving reproducibility and traceability in logs.

🐛 Bug Fixes & Reliability Improvements

Node Condition Message Limiting

Node condition messages are now automatically truncated to 1024 bytes to prevent Kubernetes API server issues with excessively large messages. This prevents edge cases where verbose error descriptions could cause API errors.

Quarantine Override Handling

Quarantine overrides are now properly applied to nodes that are already in quarantined state, ensuring manual overrides work consistently regardless of node state.

Data Model Type Safety

Recommended action type changed from integer to string for better API clarity, type safety, and human readability in configurations.

Data Model Consistency

Corrected use of IGNORE to NONE throughout the data model for consistency with the canonical data schema.

Log Collector Concurrency

Improved handling of must-gather toggle and concurrent log collector job scenarios to prevent resource conflicts and ensure reliable log collection.

🧪 Testing & Quality Improvements

Enhanced Tilt Testing

Comprehensive tilt tests added for the CSP health monitor with deterministic test behavior without sleep-based timing, improving the developer experience with faster and more reliable tests.

Scale Testing Framework

New performance and scale tests to validate NVSentinel behavior under load:

  • FQM Latency & Queue Depth: Tests for fault quarantine module performance characteristics
  • API Server & MongoDB Performance: Validation of data layer performance at scale

Log Collector Tilt Tests

Added automated tilt tests for the log collector module, improving test coverage for critical troubleshooting workflows.

📚 Documentation Improvements

Operational Documentation

  • Datastore architecture and migration documentation
  • Comprehensive configuration reference
  • Feature documentation and user guides
  • Runbooks for common operational scenarios
  • Upgrade procedures and best practices
  • IAM setup guide for CSP health monitor
  • Documentation for pre-installed GPU driver support

🔄 Dependencies & Maintenance

Security Updates

  • Upgraded Go modules to address CVEs in dependencies
  • Bumped various dependencies to latest stable versions

CI/CD Improvements

  • Added dependabot configuration for GPU API
  • Enhanced GitHub Actions workflows
  • Improved contributor automation with copy-pr-bot updates

🙏 Acknowledgments

This release includes contributions from an amazing team across NVIDIA and the community:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!

📦 What's Included

Container Images (15 components)

  • gpu-health-monitor-dcgm3 / gpu-health-monitor-dcgm4
  • syslog-health-monitor
  • csp-health-monitor
  • metadata-collector
  • platform-connectors
  • health-events-analyzer
  • fault-quarantine
  • labeler
  • node-drainer
  • fault-remediation
  • janitor
  • log-collector
  • file-server-cleanup
  • event-exporter
  • kubernetes-object-monitor

All images include the latest bug fixes, security updates, and feature enhancements from this release.

🔗 Resources

⚠️ Known Limitations

  • This is an experimental/preview release - use caution in production environments
  • Some features are disabled by default and must be explicitly enabled
  • Manual intervention may still be required for certain complex failure scenarios
  • Custom drain handlers require implementing the drain handler interface
  • PostgreSQL backend is in preview and should be thoroughly tested before production use

🚀 Getting Started

To install this release:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.5.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v0.4.x:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.5.0 \
  --namespace nvsentinel \
  --reuse-values

For detailed installation and configuration instructions, see the README and documentation in the repository.