Skip to content

fix: add NetworkPolicy workaround for nvsentinel metrics-access restriction#68

Merged
mchmarny merged 3 commits intoNVIDIA:mainfrom
yuanchen8911:fix/nvsentinel-netpol-workaround
Feb 6, 2026
Merged

fix: add NetworkPolicy workaround for nvsentinel metrics-access restriction#68
mchmarny merged 3 commits intoNVIDIA:mainfrom
yuanchen8911:fix/nvsentinel-netpol-workaround

Conversation

@yuanchen8911
Copy link
Contributor

Summary

  • Add an additive allow-intra-namespace NetworkPolicy to neutralize the overly restrictive metrics-access NetworkPolicy created by NVSentinel (< v0.7.0)
  • The nvsentinel metrics-access policy restricts ingress to only ports 2112/9216, blocking traffic needed by other components in the namespace (e.g. DCGM port 5555, cert-manager webhook port 443)
  • Kubernetes NetworkPolicies are additive — this policy allows all intra-namespace pod-to-pod ingress, effectively neutralizing the restriction

Problem

When deployed in a shared namespace, the nvsentinel metrics-access NetworkPolicy blocks:

  • DCGM engine (port 5555): dcgm-exporter cannot connect → panic → CrashLoopBackOff
  • cert-manager webhook (port 443): startupapicheck fails with connection timeout

Changes

  • New: pkg/recipe/data/components/nvsentinel/manifests/allow-intra-namespace.yaml
  • Modified: pkg/recipe/data/overlays/base.yaml — added manifestFiles reference to nvsentinel component

Upstream fix

NVIDIA/NVSentinel#789 (merged) adds a networkPolicy.enabled flag. This workaround should be removed once we pick up the nvsentinel chart version that includes it.

Test plan

  • go test ./pkg/recipe/... ./pkg/bundler/... — all pass
  • eidos recipe + eidos bundle — generates bundle with allow-intra-namespace.yaml in templates/
  • helm template — renders valid NetworkPolicy
  • Deploy and verify dcgm-exporter connects to nvidia-dcgm:5555 successfully

🤖 Generated with Claude Code

…iction

NVSentinel (< v0.7.0) creates a "metrics-access" NetworkPolicy that
restricts ingress to only metrics ports (2112/9216), blocking traffic
needed by other components in the namespace — notably DCGM (port 5555)
and cert-manager webhook (port 443).

This adds an additive NetworkPolicy allowing all intra-namespace
pod-to-pod ingress to neutralize the restriction.

Upstream fix: NVIDIA/NVSentinel#789
Remove allow-intra-namespace.yaml once nvsentinel includes the
networkPolicy.enabled flag.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@yuanchen8911 yuanchen8911 requested a review from a team as a code owner February 6, 2026 01:49
@github-actions
Copy link

github-actions bot commented Feb 6, 2026

Coverage Report ✅

Metric Value
Coverage 73.2%
Threshold 70%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-73.2%25-green)

Coverage unchanged by this PR.

@mchmarny mchmarny merged commit 5232730 into NVIDIA:main Feb 6, 2026
15 checks passed
@mchmarny mchmarny deleted the fix/nvsentinel-netpol-workaround branch February 6, 2026 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants