Skip to content

Release v0.4.0

Choose a tag to compare

@github-actions github-actions released this 24 Nov 12:02
· 433 commits to main since this release
v0.4.0
7209959

Release v0.4.0

This release brings major enhancements to NVSentinel's observability, testing infrastructure, and operational flexibility. We've added powerful new monitoring capabilities, improved database options, and made significant investments in automated testing to ensure reliability at scale.

🎯 Major New Features

Health Event Exporter

NVSentinel now includes a dedicated event exporter that enables seamless integration with external monitoring and analytics systems. Export health events to your preferred data platform for long-term analysis, compliance reporting, or integration with existing observability stacks.

Kubernetes Object Health Monitor

A new monitor that tracks Kubernetes objects providing insights into the health of nodes and accelerators. This is particularly useful for monitoring node conditions set by entities that aren't yet integrated with NVSentinel, allowing you to leverage existing health signals from other monitoring tools and operators running in your cluster.

Repeated XID Pattern Detection

The health events analyzer can now identify unique XIDs within burst windows and correlate them across multiple bursts to detect repeated XID patterns. This advanced pattern matching helps identify nodes with recurring but intermittent issues, enabling proactive intervention before these patterns lead to major failures.

Enhanced Database Flexibility

You can now choose between Bitnami MongoDB and Percona MongoDB based on your organizational preferences and requirements. This flexibility allows better alignment with existing infrastructure standards and support agreements.

Local Development & Testing with KIND

We've added a complete local error injection demo that runs on KIND (Kubernetes IN Docker) clusters. This makes it easy to test NVSentinel's behavior, experiment with configurations, and validate custom integrations without requiring access to GPU hardware or cloud resources.

Unified MongoDB SDK

All MongoDB operations have been consolidated into a unified store-client SDK, providing consistent data access patterns across all modules. This refactoring improves code maintainability, reduces duplication, and makes it easier to extend NVSentinel's data layer.

🔧 Configuration & Usability Improvements

Component-Specific Tolerations

Platform connectors now support component-specific tolerations, giving you fine-grained control over which nodes the connector instances can run on. This is particularly useful in heterogeneous clusters with different taint configurations.

🐛 Bug Fixes & Reliability Improvements

  • Fixed: Nil pointer check prevents panic during graceful shutdown scenarios
  • Fixed: TypeError in GPU Health Monitor signal handler that could cause unexpected terminations
  • Fixed: Duplicate node-drainer events eliminated by ensuring consistent pod list ordering
  • Fixed: Partial recovery healthy events are no longer incorrectly propagated to node drainer and fault remediation modules
  • Fixed: CSP monitor reliability improvements for better cloud provider integration
  • Fixed: ECR registry used for base images to avoid Docker Hub rate limiting
  • Fixed: SAFE_REF used in Helm publish workflow to handle special characters in branch names
  • Added: Pre-upgrade Helm hook automatically cleans up deprecated node conditions during upgrades

🧪 Testing & Quality Improvements

Automated User Acceptance Testing (UAT)

  • AWS UAT: Automated end-to-end tests running on actual AWS infrastructure with GPU instances
  • GCP UAT: Comprehensive UAT coverage on Google Cloud Platform

Development Environment

  • Fixed: Linux development environment setup issues resolved

Test Configuration

  • Updated test configurations to use more appropriate time windows, reducing test flakiness while maintaining coverage

🏗️ Infrastructure & Development

Dependency Management

  • Multiple dependency updates merged from Dependabot across AWS SDK, configuration libraries, and other critical dependencies
  • Helm version pinned to v3.19.2 to ensure consistent behavior across environments
  • Upgraded various Go and Python packages to latest stable versions

CI/CD Improvements

  • Removed paths-ignore in GitHub Actions to improve integration with copy-pr-bot
  • Enhanced workflow reliability and error handling
  • Better handling of branch names and special characters in automation

📚 Documentation

Updated Documentation

  • Comprehensive log collection documentation with detailed troubleshooting guides
  • Updated guides to reflect current best practices

🙏 Acknowledgments

This release includes contributions from multiple contributors across NVIDIA and the community:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!

📦 What's Included

Container Images (15 components)

  • gpu-health-monitor-dcgm3 / gpu-health-monitor-dcgm4
  • syslog-health-monitor
  • csp-health-monitor
  • metadata-collector
  • platform-connectors
  • health-events-analyzer
  • fault-quarantine
  • labeler
  • node-drainer
  • fault-remediation
  • janitor
  • log-collector
  • file-server-cleanup
  • event-exporter (NEW)
  • kubernetes-object-monitor (NEW)

🔗 Resources

⚠️ Known Limitations

  • This is an experimental/preview release - use caution in production environments
  • Some features are disabled by default and must be explicitly enabled
  • Manual intervention may still be required for certain complex failure scenarios
  • The Kubernetes object monitor is in preview and may require tuning for specific workloads

🚀 Getting Started

To install this release:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.4.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v0.3.0:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.4.0 \
  --namespace nvsentinel \
  --reuse-values

For detailed installation and configuration instructions, see the README and documentation in the repository.