Skip to content

docs: document multi-user metrics isolation (PR #1523)#1656

Merged
lbliii merged 5 commits into
26.04-stagingfrom
lbliii/doc-pr-1523
Apr 2, 2026
Merged

docs: document multi-user metrics isolation (PR #1523)#1656
lbliii merged 5 commits into
26.04-stagingfrom
lbliii/doc-pr-1523

Conversation

@lbliii
Copy link
Copy Markdown
Contributor

@lbliii lbliii commented Mar 24, 2026

Description

Documents the Prometheus/Grafana multi-user metrics isolation feature from #1523 in the v26.04 fern docs. Replaces the 26.02 release notes with a 26.04 skeleton containing the metrics isolation entry, and expands the monitoring setup section in the memory management guide with step-by-step instructions for start_prometheus_grafana.py, RayClient configuration, SLURM considerations, and multi-user cluster usage.

Usage

from nemo_curator.core.client import RayClient

ray_client = RayClient(
    include_dashboard=True,
    metrics_dir="/shared/metrics/user_a"
)
ray_client.start()

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

lbliii and others added 4 commits March 24, 2026 14:39
Document the multi-user metrics isolation feature (per-user metrics
directories, metrics_dir parameter, PID-file tracking, auto-generated
Ray dashboards, graceful cleanup). Expand the monitoring setup section
in memory-management.mdx with step-by-step instructions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Add periods to complete-sentence list items in release notes.
Fix passive voice ("are tracked" → active, "is stored" → active).
Adjust phrasing for PACE voice consistency.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
The Ray dashboard generator uses "default" not "core" as the name
(generate_default_grafana_dashboard → ray_default_dashboard.json).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Move Prometheus/Grafana content from memory-management.mdx into a new
monitoring.mdx page under reference/infrastructure. Update nav, release
notes link, and best practices cross-references. Rename step headers
from "Step N:" to "N." for consistency.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Combine multi-user metrics isolation entry from PR #1523 with
Cosmos-Xenna 0.2.0, Workflow Results API, bug fixes, and breaking
changes added to 26.04-staging.

Signed-off-by: Logan Lane <llane@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants