Release v0.4.0
Release v0.4.0
This release brings major enhancements to NVSentinel's observability, testing infrastructure, and operational flexibility. We've added powerful new monitoring capabilities, improved database options, and made significant investments in automated testing to ensure reliability at scale.
🎯 Major New Features
Health Event Exporter
NVSentinel now includes a dedicated event exporter that enables seamless integration with external monitoring and analytics systems. Export health events to your preferred data platform for long-term analysis, compliance reporting, or integration with existing observability stacks.
Kubernetes Object Health Monitor
A new monitor that tracks Kubernetes objects providing insights into the health of nodes and accelerators. This is particularly useful for monitoring node conditions set by entities that aren't yet integrated with NVSentinel, allowing you to leverage existing health signals from other monitoring tools and operators running in your cluster.
Repeated XID Pattern Detection
The health events analyzer can now identify unique XIDs within burst windows and correlate them across multiple bursts to detect repeated XID patterns. This advanced pattern matching helps identify nodes with recurring but intermittent issues, enabling proactive intervention before these patterns lead to major failures.
Enhanced Database Flexibility
You can now choose between Bitnami MongoDB and Percona MongoDB based on your organizational preferences and requirements. This flexibility allows better alignment with existing infrastructure standards and support agreements.
Local Development & Testing with KIND
We've added a complete local error injection demo that runs on KIND (Kubernetes IN Docker) clusters. This makes it easy to test NVSentinel's behavior, experiment with configurations, and validate custom integrations without requiring access to GPU hardware or cloud resources.
Unified MongoDB SDK
All MongoDB operations have been consolidated into a unified store-client SDK, providing consistent data access patterns across all modules. This refactoring improves code maintainability, reduces duplication, and makes it easier to extend NVSentinel's data layer.
🔧 Configuration & Usability Improvements
Component-Specific Tolerations
Platform connectors now support component-specific tolerations, giving you fine-grained control over which nodes the connector instances can run on. This is particularly useful in heterogeneous clusters with different taint configurations.
🐛 Bug Fixes & Reliability Improvements
- Fixed: Nil pointer check prevents panic during graceful shutdown scenarios
- Fixed: TypeError in GPU Health Monitor signal handler that could cause unexpected terminations
- Fixed: Duplicate node-drainer events eliminated by ensuring consistent pod list ordering
- Fixed: Partial recovery healthy events are no longer incorrectly propagated to node drainer and fault remediation modules
- Fixed: CSP monitor reliability improvements for better cloud provider integration
- Fixed: ECR registry used for base images to avoid Docker Hub rate limiting
- Fixed: SAFE_REF used in Helm publish workflow to handle special characters in branch names
- Added: Pre-upgrade Helm hook automatically cleans up deprecated node conditions during upgrades
🧪 Testing & Quality Improvements
Automated User Acceptance Testing (UAT)
- AWS UAT: Automated end-to-end tests running on actual AWS infrastructure with GPU instances
- GCP UAT: Comprehensive UAT coverage on Google Cloud Platform
Development Environment
- Fixed: Linux development environment setup issues resolved
Test Configuration
- Updated test configurations to use more appropriate time windows, reducing test flakiness while maintaining coverage
🏗️ Infrastructure & Development
Dependency Management
- Multiple dependency updates merged from Dependabot across AWS SDK, configuration libraries, and other critical dependencies
- Helm version pinned to v3.19.2 to ensure consistent behavior across environments
- Upgraded various Go and Python packages to latest stable versions
CI/CD Improvements
- Removed paths-ignore in GitHub Actions to improve integration with copy-pr-bot
- Enhanced workflow reliability and error handling
- Better handling of branch names and special characters in automation
📚 Documentation
Updated Documentation
- Comprehensive log collection documentation with detailed troubleshooting guides
- Updated guides to reflect current best practices
🙏 Acknowledgments
This release includes contributions from multiple contributors across NVIDIA and the community:
- @lalitadithya
- @XRFXLP
- @ksaur
- @KaivalyaMDabhadkar
- @Gyan172004
- @mchmarny
- @dims
- @tanishagoyal2
- @rupalis-nv
Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!
📦 What's Included
Container Images (15 components)
gpu-health-monitor-dcgm3/gpu-health-monitor-dcgm4syslog-health-monitorcsp-health-monitormetadata-collectorplatform-connectorshealth-events-analyzerfault-quarantinelabelernode-drainerfault-remediationjanitorlog-collectorfile-server-cleanupevent-exporter(NEW)kubernetes-object-monitor(NEW)
🔗 Resources
- GitHub Repository: https://github.com/NVIDIA/NVSentinel
- Container Registry: ghcr.io/nvidia/nvsentinel
- Documentation: See
/docsdirectory in repository - Issue Tracker: https://github.com/NVIDIA/NVSentinel/issues
- Discussions: https://github.com/NVIDIA/NVSentinel/discussions
⚠️ Known Limitations
- This is an experimental/preview release - use caution in production environments
- Some features are disabled by default and must be explicitly enabled
- Manual intervention may still be required for certain complex failure scenarios
- The Kubernetes object monitor is in preview and may require tuning for specific workloads
🚀 Getting Started
To install this release:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.4.0 \
--namespace nvsentinel \
--create-namespaceTo upgrade from v0.3.0:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.4.0 \
--namespace nvsentinel \
--reuse-valuesFor detailed installation and configuration instructions, see the README and documentation in the repository.