Problem Statement
After OS-49 ships, our CI perf will have been tuned through a one-shot baseline (scripts/baseline_workflow_metrics.py — merged via #927) and phased migrations gated on decision tables. The script is disposable by design: re-run manually at phase cut-overs, not continuously.
Post-migration, we have no durable visibility of:
- Workflow wall/queue time drift over weeks and months
- Success-rate regressions on specific workflows (e.g.,
release-vm-kernel.yml sat at 14% success in the OS-125 baseline — would we notice the same happening elsewhere?)
- GHA cache hit-rate degradation as repo volume grows into the 10 GB quota
- Runner-pool health on nv-gha-runners (queue tail was the primary reason we migrated off ARC)
Without continuous measurement, we'll only hear about regressions when someone complains about a slow PR — exactly the state we're migrating away from.
Proposed Design
Use NVIDIA's canonical observability stack ("LGTM" via the Observability Service), not custom infra:
ci-metrics-collect.yml Observability Service grafana.nvidia.com
┌────────────────────┐ OTLP-HTTP ┌──────────────────────────┐ ┌──────────────────┐
│ weekly cron runs │ ──(v1/metrics)─▶│ Mimir (metrics) │◀────────│ CI Perf dashboard │
│ baseline_workflow │ │ │◀────────│ + alerts │
│ _metrics.py │ │ │ │ │
└────────────────────┘ └──────────────────────────┘ └──────────────────┘
GitHub Actions https://otlp-http.nvidia.com HWInf Grafana
Auth: NVAuth token header per-org DL
Steps to implement:
- Service Registry — register
openshell-ci-metrics service at https://service-registry.nvidia.com/ (if not already present).
- NVAuth token — create a Service-to-Service token at
https://nv-auth.nvidia.com/, target = observability-service. Store in repo secrets as NVAUTH_OBSERVABILITY_TOKEN.
- Data Portal dataset — create a Mimir dataset at
https://data-portal.nvidia.com/ tied to a team DL.
- Grafana org — provision via Data Portal; wire Mimir as datasource.
- Workflow — new
.github/workflows/ci-metrics-collect.yml on schedule: cron weekly + workflow_dispatch. Runs baseline_workflow_metrics.py, converts output to OTLP metrics, POSTs to https://otlp-http.nvidia.com/v1/metrics with Authorization header.
- Dashboard — panels keyed on PromQL queries against Mimir:
gha_workflow_wall_p50_seconds{workflow=…}, rolling 30/90-day windows, success-rate heatmap.
- Alerting — Grafana alert rules on
success_rate < 0.80 or wall_p50 > 2 × baseline, routed to #openshell Slack (or equivalent).
Alternatives Considered
| Approach |
Trade-off |
Verdict |
| OTLP → LGTM via Observability Service (proposed) |
NVIDIA-standard pattern per the canonical Observability Onboarding Guide. Mimir is pre-integrated with grafana.nvidia.com. Standard NVAuth path. |
Preferred. |
grafana-github-datasource plugin — query GitHub API live |
Plugin not installed on NVIDIA Grafana; requires HWinf Grafana Plugin Installation Request. Also: GitHub rate limits, awkward percentile math in Grafana transforms. |
Rejected. |
| Ship to NVDataFlow Elasticsearch instead of Mimir |
NVDataFlow is a general data warehouse, not the canonical observability backend. Mimir is purpose-built for metrics + Prometheus-compatible querying. |
Rejected for this use case. |
Use ITMP Grafana instance (grafana.service.prd.itmp.nvidia.com) |
Separate instance, different team, different access model. No reason to leave the HWinf instance. |
Rejected. |
| Custom pipeline (cf. pimlock's cc to @TaylorMutch) |
TBD — pending input from @TaylorMutch |
TBD |
| GHE Actions Insights (native org feature) |
Confirmed not enabled for OpenShell — 404 on https://github.com/NVIDIA/OpenShell/actions/metrics as of 2026-04-23. |
Not available. |
| Status quo — manual re-runs of the script |
Zero infra. Risks silent regressions. Fine for short-term; inadequate post-migration. |
Insufficient. |
Agent Investigation
scripts/baseline_workflow_metrics.py already produces the metric shape we want (wall p50/p95, queue p50/p95, success rate, run count). Output needs OTLP conversion, not re-computation.
- Canonical NVIDIA observability stack confirmed from
http://nv/observability (Observability Onboarding Guide):
- Ingest:
https://otlp-http.nvidia.com/v1/{metrics,logs,traces}
- Backends: Mimir / Loki / Tempo (LGTM stack, managed by hwinf-ltm)
- Visualization:
grafana.nvidia.com (multi-tenant, per-org DLs)
- Support:
#hwinf-ltm-support Slack
- NVDataFlow (Elasticsearch) IS being kept as the long-term analytics store (transition back from OpenSearch is in progress), but it is not the metrics observability path — separate use case.
grafana-github-datasource plugin is not installed on grafana.nvidia.com; would need a formal request via HWinf Grafana Plugin Installation Request. Unnecessary if we use Mimir.
- GHE Actions Insights (native) confirmed not exposed for OpenShell — both
/actions/metrics and org-level equivalent return 404.
- Service Registry status for
openshell-ci-metrics: unverified (SSO-only API). Likely does not exist yet.
Open Questions
- Does OpenShell already have a Service Registry entry we can reuse? (Needs SSO check.)
- Which DL gets Admin on the new Grafana org —
openshell-maintainers or a new one?
- Is
#openshell the right Slack target for alert routing, or should it go somewhere more ops-focused?
- Does @pimlock / @TaylorMutch have context on the prior custom pipeline that should inform design (label schema, metric naming conventions)?
Checklist
Problem Statement
After OS-49 ships, our CI perf will have been tuned through a one-shot baseline (
scripts/baseline_workflow_metrics.py— merged via #927) and phased migrations gated on decision tables. The script is disposable by design: re-run manually at phase cut-overs, not continuously.Post-migration, we have no durable visibility of:
release-vm-kernel.ymlsat at 14% success in the OS-125 baseline — would we notice the same happening elsewhere?)Without continuous measurement, we'll only hear about regressions when someone complains about a slow PR — exactly the state we're migrating away from.
Proposed Design
Use NVIDIA's canonical observability stack ("LGTM" via the Observability Service), not custom infra:
Steps to implement:
openshell-ci-metricsservice athttps://service-registry.nvidia.com/(if not already present).https://nv-auth.nvidia.com/, target =observability-service. Store in repo secrets asNVAUTH_OBSERVABILITY_TOKEN.https://data-portal.nvidia.com/tied to a team DL..github/workflows/ci-metrics-collect.ymlonschedule: cron weekly + workflow_dispatch. Runsbaseline_workflow_metrics.py, converts output to OTLP metrics, POSTs tohttps://otlp-http.nvidia.com/v1/metricswithAuthorizationheader.gha_workflow_wall_p50_seconds{workflow=…}, rolling 30/90-day windows, success-rate heatmap.success_rate < 0.80orwall_p50 > 2 × baseline, routed to#openshellSlack (or equivalent).Alternatives Considered
grafana.nvidia.com. Standard NVAuth path.grafana-github-datasourceplugin — query GitHub API livegrafana.service.prd.itmp.nvidia.com)https://github.com/NVIDIA/OpenShell/actions/metricsas of 2026-04-23.Agent Investigation
scripts/baseline_workflow_metrics.pyalready produces the metric shape we want (wall p50/p95, queue p50/p95, success rate, run count). Output needs OTLP conversion, not re-computation.http://nv/observability(Observability Onboarding Guide):https://otlp-http.nvidia.com/v1/{metrics,logs,traces}grafana.nvidia.com(multi-tenant, per-org DLs)#hwinf-ltm-supportSlackgrafana-github-datasourceplugin is not installed ongrafana.nvidia.com; would need a formal request via HWinf Grafana Plugin Installation Request. Unnecessary if we use Mimir./actions/metricsand org-level equivalent return 404.openshell-ci-metrics: unverified (SSO-only API). Likely does not exist yet.Open Questions
openshell-maintainersor a new one?#openshellthe right Slack target for alert routing, or should it go somewhere more ops-focused?Checklist