Skip to content

Add monitoring and health operations guide#1831

Draft
martinraumann wants to merge 8 commits into
NVIDIA:mainfrom
martinraumann:forge-8255-monitoring-health-docs
Draft

Add monitoring and health operations guide#1831
martinraumann wants to merge 8 commits into
NVIDIA:mainfrom
martinraumann:forge-8255-monitoring-health-docs

Conversation

@martinraumann
Copy link
Copy Markdown
Contributor

Summary

Adds a Monitoring and Health page under Operations for NICo post-deployment workflows.

The guide covers:

  • hardware health monitoring through Redfish and hardware health services
  • DPU health checks and troubleshooting entry points
  • health alert lifecycle and health overrides
  • Prometheus ServiceMonitor configuration and metric groups
  • API health and availability checks
  • Grafana, Loki, and log lookup patterns

Validation

  • Checked the branch diff against main
  • Ran markdown scrub checks for legacy branding and internal gap-tracking language
  • Ran git diff --check

Notes

This is intended as the Operations Guide / Monitoring and Health page referenced by the documentation IA.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 20, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant