chore(ci): add ARC baseline collector for OS-49 runner migration#927
chore(ci): add ARC baseline collector for OS-49 runner migration#927
Conversation
Signed-off-by: Jonas Toelke <jtoelke@nvidia.com>
|
All contributors have signed the DCO ✍️ ✅ |
|
I have read the DCO document and I hereby sign the DCO. |
|
recheck |
|
What are you planning to use to keep track of data? I.e. where is it going to be stored? I wonder if having some kind of observability platform hooked up would make sense here, so we can track things not just as a one off, but long term as well? There is a hosted Grafana that can use GitHub as a datasource, maybe that would work here? Or we do something custom? cc @TaylorMutch -> I know our previous project build custom metrics/storage/dashboard around CI metrics, not sure why it was custom, but you may have some answers here. |
|
Intent here is one-shot diagnostic — the script is the evidence trail for OS-49's Phase 1 exit criterion (baseline captured), not a durable metrics pipeline. Numbers land in Linear OS-125 as a point-in-time snapshot, and we re-run manually when we need to diff against a cut-over (Phases 5–7). Long-term CI observability is out of scope for this migration but genuinely worthwhile. Happy to file a follow-up issue if that sounds right — a hosted-Grafana-with-GitHub-datasource approach would avoid the custom-metrics-platform pattern @TaylorMutch is probably thinking of. For now, PR 927 is just a stdlib Python one-shot: zero deps, zero infra, drop-in disposable. |
|
Sounds good @jtoelke2. I was thinking if getting something longer term would be as easy as enabling integration in Grafana it would be a win-win, but it doesn't look that way, so a one-off solution for the migration sounds good for now. |
|
@pimlock @TaylorMutch — follow-up issue filed: #954 — Long-term CI observability via OTLP → Observability Service (Mimir) + Grafana. Based on a dig through NVIDIA's Observability Onboarding Guide ( One open row in the alternatives table is explicitly marked TBD pending your input, @TaylorMutch — curious whether your prior custom pipeline has label/schema conventions we should reuse, or if LGTM didn't exist yet when you built it. |
Summary
Add a stdlib-only Python script that pulls 30-day GitHub Actions baseline metrics (runs, success rate, wall p50/p95, queue p50/p95) for the ten workflows in scope for the ARC → nv-gha-runners migration. This gives us a before-snapshot to compare against as we cut workflows over in Phases 2-7.
Related Issue
Part of the OS-49 runner migration. See Linear OS-49 (parent) and OS-125 (Phase 1 baseline). The baseline numbers produced by this script are captured in the OS-125 Linear document.
Changes
scripts/baseline_workflow_metrics.py: queries/repos/{owner}/{repo}/actions/workflows/{id}/runsfor top-level workflows and/repos/{owner}/{repo}/actions/runsfiltered byreferenced_workflows[].pathfor reusable workflows (docker-build.yml,e2e-test.yml). Excludesskipped/cancelled/startup_failureruns from wall/queue percentiles. Outputs JSON and Markdown.Testing
mise run pre-commitpassesChecklist