Skip to content

chore(ci): add ARC baseline collector for OS-49 runner migration#927

Merged
jtoelke2 merged 1 commit intomainfrom
jtoelke/os-125-arc-baseline-collector
Apr 23, 2026
Merged

chore(ci): add ARC baseline collector for OS-49 runner migration#927
jtoelke2 merged 1 commit intomainfrom
jtoelke/os-125-arc-baseline-collector

Conversation

@jtoelke2
Copy link
Copy Markdown
Collaborator

Summary

Add a stdlib-only Python script that pulls 30-day GitHub Actions baseline metrics (runs, success rate, wall p50/p95, queue p50/p95) for the ten workflows in scope for the ARC → nv-gha-runners migration. This gives us a before-snapshot to compare against as we cut workflows over in Phases 2-7.

Related Issue

Part of the OS-49 runner migration. See Linear OS-49 (parent) and OS-125 (Phase 1 baseline). The baseline numbers produced by this script are captured in the OS-125 Linear document.

Changes

  • scripts/baseline_workflow_metrics.py: queries /repos/{owner}/{repo}/actions/workflows/{id}/runs for top-level workflows and /repos/{owner}/{repo}/actions/runs filtered by referenced_workflows[].path for reusable workflows (docker-build.yml, e2e-test.yml). Excludes skipped/cancelled/startup_failure runs from wall/queue percentiles. Outputs JSON and Markdown.

Testing

  • mise run pre-commit passes
  • Unit tests added/updated — none added; script is one-shot diagnostic, output verified manually against Actions UI
  • E2E tests added/updated (if applicable) — N/A

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable) — N/A; plan lives in Linear OS-49

Signed-off-by: Jonas Toelke <jtoelke@nvidia.com>
@jtoelke2 jtoelke2 requested a review from a team as a code owner April 23, 2026 00:10
@jtoelke2 jtoelke2 self-assigned this Apr 23, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 23, 2026

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@jtoelke2
Copy link
Copy Markdown
Collaborator Author

I have read the DCO document and I hereby sign the DCO.

@jtoelke2
Copy link
Copy Markdown
Collaborator Author

recheck

@pimlock
Copy link
Copy Markdown
Collaborator

pimlock commented Apr 23, 2026

What are you planning to use to keep track of data? I.e. where is it going to be stored?

I wonder if having some kind of observability platform hooked up would make sense here, so we can track things not just as a one off, but long term as well?

There is a hosted Grafana that can use GitHub as a datasource, maybe that would work here? Or we do something custom?

cc @TaylorMutch -> I know our previous project build custom metrics/storage/dashboard around CI metrics, not sure why it was custom, but you may have some answers here.

@jtoelke2
Copy link
Copy Markdown
Collaborator Author

Intent here is one-shot diagnostic — the script is the evidence trail for OS-49's Phase 1 exit criterion (baseline captured), not a durable metrics pipeline. Numbers land in Linear OS-125 as a point-in-time snapshot, and we re-run manually when we need to diff against a cut-over (Phases 5–7).

Long-term CI observability is out of scope for this migration but genuinely worthwhile. Happy to file a follow-up issue if that sounds right — a hosted-Grafana-with-GitHub-datasource approach would avoid the custom-metrics-platform pattern @TaylorMutch is probably thinking of. For now, PR 927 is just a stdlib Python one-shot: zero deps, zero infra, drop-in disposable.

@pimlock
Copy link
Copy Markdown
Collaborator

pimlock commented Apr 23, 2026

Sounds good @jtoelke2. I was thinking if getting something longer term would be as easy as enabling integration in Grafana it would be a win-win, but it doesn't look that way, so a one-off solution for the migration sounds good for now.

@jtoelke2 jtoelke2 merged commit 75b880b into main Apr 23, 2026
13 of 14 checks passed
@jtoelke2 jtoelke2 deleted the jtoelke/os-125-arc-baseline-collector branch April 23, 2026 21:52
@jtoelke2
Copy link
Copy Markdown
Collaborator Author

@pimlock @TaylorMutch — follow-up issue filed: #954Long-term CI observability via OTLP → Observability Service (Mimir) + Grafana.

Based on a dig through NVIDIA's Observability Onboarding Guide (http://nv/observability), the canonical path turns out to be OTLP → LGTM (Mimir for metrics) → grafana.nvidia.com — not Grafana's GitHub datasource plugin (not installed on our instance) and not NVDataFlow (wrong backend for observability). GHE Actions Insights is 404 for OpenShell, ruling out the zero-effort native path.

One open row in the alternatives table is explicitly marked TBD pending your input, @TaylorMutch — curious whether your prior custom pipeline has label/schema conventions we should reuse, or if LGTM didn't exist yet when you built it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants