Long-term CI observability via OTLP → Observability Service (Mimir) + Grafana

## Problem Statement

After OS-49 ships, our CI perf will have been tuned through a one-shot baseline (`scripts/baseline_workflow_metrics.py` — merged via [#927](https://github.com/NVIDIA/OpenShell/pull/927)) and phased migrations gated on decision tables. The script is disposable by design: re-run manually at phase cut-overs, not continuously.

Post-migration, we have no durable visibility of:

- Workflow wall/queue time drift over weeks and months
- Success-rate regressions on specific workflows (e.g., `release-vm-kernel.yml` sat at 14% success in the OS-125 baseline — would we notice the same happening elsewhere?)
- GHA cache hit-rate degradation as repo volume grows into the 10 GB quota
- Runner-pool health on nv-gha-runners (queue tail was the primary reason we migrated off ARC)

Without continuous measurement, we'll only hear about regressions when someone complains about a slow PR — exactly the state we're migrating away from.

## Proposed Design

Use NVIDIA's canonical observability stack ("LGTM" via the Observability Service), not custom infra:

```
  ci-metrics-collect.yml                Observability Service                 grafana.nvidia.com
┌────────────────────┐    OTLP-HTTP    ┌──────────────────────────┐         ┌──────────────────┐
│ weekly cron runs   │ ──(v1/metrics)─▶│ Mimir  (metrics)         │◀────────│ CI Perf dashboard │
│ baseline_workflow  │                 │                          │◀────────│ + alerts          │
│ _metrics.py        │                 │                          │         │                   │
└────────────────────┘                 └──────────────────────────┘         └──────────────────┘
     GitHub Actions                   https://otlp-http.nvidia.com                 HWInf Grafana
                                      Auth: NVAuth token header                    per-org DL
```

**Steps to implement:**

1. **Service Registry** — register `openshell-ci-metrics` service at `https://service-registry.nvidia.com/` (if not already present).
2. **NVAuth token** — create a Service-to-Service token at `https://nv-auth.nvidia.com/`, target = `observability-service`. Store in repo secrets as `NVAUTH_OBSERVABILITY_TOKEN`.
3. **Data Portal dataset** — create a Mimir dataset at `https://data-portal.nvidia.com/` tied to a team DL.
4. **Grafana org** — provision via Data Portal; wire Mimir as datasource.
5. **Workflow** — new `.github/workflows/ci-metrics-collect.yml` on `schedule: cron weekly + workflow_dispatch`. Runs `baseline_workflow_metrics.py`, converts output to OTLP metrics, POSTs to `https://otlp-http.nvidia.com/v1/metrics` with `Authorization` header.
6. **Dashboard** — panels keyed on PromQL queries against Mimir: `gha_workflow_wall_p50_seconds{workflow=…}`, rolling 30/90-day windows, success-rate heatmap.
7. **Alerting** — Grafana alert rules on `success_rate < 0.80` or `wall_p50 > 2 × baseline`, routed to `#openshell` Slack (or equivalent).

## Alternatives Considered

| Approach | Trade-off | Verdict |
|---|---|---|
| **OTLP → LGTM via Observability Service** (proposed) | NVIDIA-standard pattern per the canonical Observability Onboarding Guide. Mimir is pre-integrated with `grafana.nvidia.com`. Standard NVAuth path. | Preferred. |
| `grafana-github-datasource` plugin — query GitHub API live | Plugin not installed on NVIDIA Grafana; requires HWinf Grafana Plugin Installation Request. Also: GitHub rate limits, awkward percentile math in Grafana transforms. | Rejected. |
| Ship to NVDataFlow Elasticsearch instead of Mimir | NVDataFlow is a general data warehouse, not the canonical observability backend. Mimir is purpose-built for metrics + Prometheus-compatible querying. | Rejected for this use case. |
| Use ITMP Grafana instance (`grafana.service.prd.itmp.nvidia.com`) | Separate instance, different team, different access model. No reason to leave the HWinf instance. | Rejected. |
| Custom pipeline (cf. pimlock's cc to @TaylorMutch) | TBD — pending input from @TaylorMutch | TBD |
| GHE Actions Insights (native org feature) | Confirmed not enabled for OpenShell — 404 on `https://github.com/NVIDIA/OpenShell/actions/metrics` as of 2026-04-23. | Not available. |
| Status quo — manual re-runs of the script | Zero infra. Risks silent regressions. Fine for short-term; inadequate post-migration. | Insufficient. |

## Agent Investigation

- `scripts/baseline_workflow_metrics.py` already produces the metric shape we want (wall p50/p95, queue p50/p95, success rate, run count). Output needs OTLP conversion, not re-computation.
- Canonical NVIDIA observability stack confirmed from `http://nv/observability` (Observability Onboarding Guide):
  - Ingest: `https://otlp-http.nvidia.com/v1/{metrics,logs,traces}`
  - Backends: Mimir / Loki / Tempo (LGTM stack, managed by hwinf-ltm)
  - Visualization: `grafana.nvidia.com` (multi-tenant, per-org DLs)
  - Support: `#hwinf-ltm-support` Slack
- NVDataFlow (Elasticsearch) IS being kept as the long-term analytics store (transition back from OpenSearch is in progress), but it is **not** the metrics observability path — separate use case.
- `grafana-github-datasource` plugin is **not** installed on `grafana.nvidia.com`; would need a formal request via HWinf Grafana Plugin Installation Request. Unnecessary if we use Mimir.
- GHE Actions Insights (native) confirmed not exposed for OpenShell — both `/actions/metrics` and org-level equivalent return 404.
- Service Registry status for `openshell-ci-metrics`: unverified (SSO-only API). Likely does not exist yet.

## Open Questions

1. Does OpenShell already have a Service Registry entry we can reuse? (Needs SSO check.)
2. Which DL gets Admin on the new Grafana org — `openshell-maintainers` or a new one?
3. Is `#openshell` the right Slack target for alert routing, or should it go somewhere more ops-focused?
4. Does @pimlock / @TaylorMutch have context on the prior custom pipeline that should inform design (label schema, metric naming conventions)?

## Checklist

- [x] I've reviewed existing issues and the architecture docs — no duplicate issue.
- [x] This is a design proposal, not a "please build this" request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long-term CI observability via OTLP → Observability Service (Mimir) + Grafana #954

Problem Statement

Proposed Design

Alternatives Considered

Agent Investigation

Open Questions

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Approach	Trade-off	Verdict
OTLP → LGTM via Observability Service (proposed)	NVIDIA-standard pattern per the canonical Observability Onboarding Guide. Mimir is pre-integrated with `grafana.nvidia.com`. Standard NVAuth path.	Preferred.
`grafana-github-datasource` plugin — query GitHub API live	Plugin not installed on NVIDIA Grafana; requires HWinf Grafana Plugin Installation Request. Also: GitHub rate limits, awkward percentile math in Grafana transforms.	Rejected.
Ship to NVDataFlow Elasticsearch instead of Mimir	NVDataFlow is a general data warehouse, not the canonical observability backend. Mimir is purpose-built for metrics + Prometheus-compatible querying.	Rejected for this use case.
Use ITMP Grafana instance (`grafana.service.prd.itmp.nvidia.com`)	Separate instance, different team, different access model. No reason to leave the HWinf instance.	Rejected.
Custom pipeline (cf. pimlock's cc to @TaylorMutch)	TBD — pending input from @TaylorMutch	TBD
GHE Actions Insights (native org feature)	Confirmed not enabled for OpenShell — 404 on `https://github.com/NVIDIA/OpenShell/actions/metrics` as of 2026-04-23.	Not available.
Status quo — manual re-runs of the script	Zero infra. Risks silent regressions. Fine for short-term; inadequate post-migration.	Insufficient.

Long-term CI observability via OTLP → Observability Service (Mimir) + Grafana #954

Description

Problem Statement

Proposed Design

Alternatives Considered

Agent Investigation

Open Questions

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions