Release v0.5.0
Release v0.5.0
This release focuses on extensibility, production hardening, and operational flexibility. We've added support for custom drain handlers, PostgreSQL as an alternative database backend, comprehensive audit logging, and expanded our XID detection and remediation capabilities.
🎯 Major New Features
Custom Drain Extensibility
NVSentinel now supports custom drain handlers, allowing integration with specialized workload orchestrators. This feature enables organizations running HPC schedulers like Slinky, big data frameworks like Volcano, or ML platforms like Ray to integrate their custom drain logic seamlessly. The release includes a complete demo environment showcasing custom drain integration.
PostgreSQL Database Backend
Added PostgreSQL as a production-grade alternative to MongoDB, providing more flexibility in database selection. This addresses licensing concerns, operational preferences, and allows better alignment with existing infrastructure standards.
Note: Support for PostgreSQL is experimental and it is not recommended in production clusters
Audit Logging
Comprehensive audit logging for all NVSentinel write operations enables compliance reporting, security analysis, and operational troubleshooting. Every mutation is tracked with context about what changed, when, and by which component. The structured audit logs support configurable retention and rotation, with formats ready for integration with SIEM systems.
🔧 Enhanced Fault Detection & Remediation
XID 13 & XID 31 Workflow Implementation
Automated workflows for handling critical GPU error conditions for XID 13 and 31. These workflows help catch GPU degradation early.
XID 154 Support
Added support for detecting and handling XID 154 (GPU Recovery Action Changed) events.
Pre-Installed Driver Support
Enhanced support for environments with driver installed outside of GPU operator
🏗️ Infrastructure & Architecture Improvements
ko-based Kubernetes Object Monitor
Migrated the Kubernetes object monitor to ko-based builds, resulting in faster build times for development iterations, smaller container images with reduced attack surface, and improved supply chain security with minimal base images.
Enhanced Build System
Version field is now properly passed from build args to Dockerfile for accurate version reporting, improving reproducibility and traceability in logs.
🐛 Bug Fixes & Reliability Improvements
Node Condition Message Limiting
Node condition messages are now automatically truncated to 1024 bytes to prevent Kubernetes API server issues with excessively large messages. This prevents edge cases where verbose error descriptions could cause API errors.
Quarantine Override Handling
Quarantine overrides are now properly applied to nodes that are already in quarantined state, ensuring manual overrides work consistently regardless of node state.
Data Model Type Safety
Recommended action type changed from integer to string for better API clarity, type safety, and human readability in configurations.
Data Model Consistency
Corrected use of IGNORE to NONE throughout the data model for consistency with the canonical data schema.
Log Collector Concurrency
Improved handling of must-gather toggle and concurrent log collector job scenarios to prevent resource conflicts and ensure reliable log collection.
🧪 Testing & Quality Improvements
Enhanced Tilt Testing
Comprehensive tilt tests added for the CSP health monitor with deterministic test behavior without sleep-based timing, improving the developer experience with faster and more reliable tests.
Scale Testing Framework
New performance and scale tests to validate NVSentinel behavior under load:
- FQM Latency & Queue Depth: Tests for fault quarantine module performance characteristics
- API Server & MongoDB Performance: Validation of data layer performance at scale
Log Collector Tilt Tests
Added automated tilt tests for the log collector module, improving test coverage for critical troubleshooting workflows.
📚 Documentation Improvements
Operational Documentation
- Datastore architecture and migration documentation
- Comprehensive configuration reference
- Feature documentation and user guides
- Runbooks for common operational scenarios
- Upgrade procedures and best practices
- IAM setup guide for CSP health monitor
- Documentation for pre-installed GPU driver support
🔄 Dependencies & Maintenance
Security Updates
- Upgraded Go modules to address CVEs in dependencies
- Bumped various dependencies to latest stable versions
CI/CD Improvements
- Added dependabot configuration for GPU API
- Enhanced GitHub Actions workflows
- Improved contributor automation with copy-pr-bot updates
🙏 Acknowledgments
This release includes contributions from an amazing team across NVIDIA and the community:
- @rupalis-nv
- @XRFXLP
- @tanishagoyal2
- @dims
- @lalitadithya
- @KaivalyaMDabhadkar
- @deesharma24
- @nitz2407
- @ksaur
- @pteranodan
- @natherz97
- @jtschelling
- @ArangoGutierrez
Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!
📦 What's Included
Container Images (15 components)
gpu-health-monitor-dcgm3/gpu-health-monitor-dcgm4syslog-health-monitorcsp-health-monitormetadata-collectorplatform-connectorshealth-events-analyzerfault-quarantinelabelernode-drainerfault-remediationjanitorlog-collectorfile-server-cleanupevent-exporterkubernetes-object-monitor
All images include the latest bug fixes, security updates, and feature enhancements from this release.
🔗 Resources
- GitHub Repository: https://github.com/NVIDIA/NVSentinel
- Container Registry: ghcr.io/nvidia/nvsentinel
- Documentation: See
/docsdirectory in repository - Issue Tracker: https://github.com/NVIDIA/NVSentinel/issues
- Discussions: https://github.com/NVIDIA/NVSentinel/discussions
⚠️ Known Limitations
- This is an experimental/preview release - use caution in production environments
- Some features are disabled by default and must be explicitly enabled
- Manual intervention may still be required for certain complex failure scenarios
- Custom drain handlers require implementing the drain handler interface
- PostgreSQL backend is in preview and should be thoroughly tested before production use
🚀 Getting Started
To install this release:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.5.0 \
--namespace nvsentinel \
--create-namespaceTo upgrade from v0.4.x:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.5.0 \
--namespace nvsentinel \
--reuse-valuesFor detailed installation and configuration instructions, see the README and documentation in the repository.