CI: add runner label queue time analytics#2606
Conversation
Relocate CI monitor helper scripts to .github/scripts and update workflow trigger paths and execution paths accordingly.
Fetch Actions data once per run and consume it in matrix and fleet reports to enable PR coverage while reducing API pressure and schedule input edge cases.
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
There was a problem hiding this comment.
Pull request overview
This PR enhances the AMD CI job monitor workflow by introducing runner-label-based queue time analytics, and by switching reporting to operate from a cached “actions snapshot” artifact so queue/duration metrics can be computed without re-querying the GitHub API per job.
Changes:
- Added snapshot capture/consume flow in
amd-ci-job-monitor.ymlso per-job reports and runner fleet reports can be generated from cached job metadata. - Replaced the previous
scripts/ci/query_job_status.pywith an updated.github/scripts/query_job_status.pythat records timestamps/labels and computes queue-time and duration analytics. - Updated
.github/scripts/list_jobs.pyto emit both job IDs and stable display names for matrix generation.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
scripts/ci/query_job_status.py |
Removed the legacy job status reporter script (replaced by the .github/scripts/ version). |
.github/workflows/amd-ci-job-monitor.yml |
Adds permissions, snapshot fetch/upload, and switches report generation to consume the snapshot artifact; renames runner fleet step to emphasize queue reporting. |
.github/scripts/query_job_status.py |
New reporter supporting snapshot in/out, runner-label grouping, and queue/duration analytics tables. |
.github/scripts/list_jobs.py |
Improves workflow job discovery by extracting stable job display names and including job IDs in the matrix. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| all_rows: list[dict[str, Any]] = [] | ||
| if args.snapshot_in: | ||
| snapshot_payload = json.loads(Path(args.snapshot_in).read_text(encoding="utf-8")) |
There was a problem hiding this comment.
When loading a snapshot via --snapshot-in, report_time is still set to datetime.now(...). This makes queue-time calculations for queued/waiting jobs depend on when the report is generated, not when the snapshot was captured, which undermines the goal of computing metrics from cached snapshot data. Consider reading generated_at from the snapshot payload and using that timestamp as report_time (fallback to now if missing/invalid).
| snapshot_payload = json.loads(Path(args.snapshot_in).read_text(encoding="utf-8")) | |
| snapshot_payload = json.loads(Path(args.snapshot_in).read_text(encoding="utf-8")) | |
| snapshot_generated_at = snapshot_payload.get("generated_at") | |
| if isinstance(snapshot_generated_at, str): | |
| try: | |
| parsed_report_time = datetime.fromisoformat( | |
| snapshot_generated_at.replace("Z", "+00:00") | |
| ) | |
| if parsed_report_time.tzinfo is None: | |
| report_time = parsed_report_time.replace(tzinfo=timezone.utc) | |
| else: | |
| report_time = parsed_report_time.astimezone(timezone.utc) | |
| except ValueError: | |
| pass |
|
|
||
| workflow_set = set(workflows) | ||
| job_rows = [ | ||
| row | ||
| for row in all_rows | ||
| if row.get("workflow") in workflow_set | ||
| and job_name_matches(args.job, row.get("job", "")) |
There was a problem hiding this comment.
--hours is effectively ignored when --snapshot-in is provided: lookback is computed but no time-based filtering is applied to all_rows / job_rows. This can produce reports that don’t match the requested time window and makes snapshot-driven runs dependent on how the snapshot was originally created. Consider filtering all_rows (or job_rows) by created_at (>= lookback) regardless of whether data came from live API calls or a snapshot.
| workflow_set = set(workflows) | |
| job_rows = [ | |
| row | |
| for row in all_rows | |
| if row.get("workflow") in workflow_set | |
| and job_name_matches(args.job, row.get("job", "")) | |
| def parse_row_created_at(row: dict[str, Any]) -> datetime | None: | |
| created_at = row.get("created_at") | |
| if not created_at: | |
| return None | |
| if isinstance(created_at, datetime): | |
| return ( | |
| created_at | |
| if created_at.tzinfo is not None | |
| else created_at.replace(tzinfo=timezone.utc) | |
| ) | |
| if not isinstance(created_at, str): | |
| return None | |
| try: | |
| return datetime.fromisoformat(created_at.replace("Z", "+00:00")) | |
| except ValueError: | |
| return None | |
| workflow_set = set(workflows) | |
| job_rows = [ | |
| row | |
| for row in all_rows | |
| if row.get("workflow") in workflow_set | |
| and job_name_matches(args.job, row.get("job", "")) | |
| and (created_at := parse_row_created_at(row)) is not None | |
| and created_at >= lookback |
| index = min(int(len(ordered) * percent / 100), len(ordered) - 1) | ||
| return ordered[index] |
There was a problem hiding this comment.
The percentile() implementation uses int(len(values) * percent / 100) to pick an index. For common cases (e.g., 90th percentile with 10 samples), this selects the maximum (index 9) rather than the 90th-percentile position (around index 8), which will systematically overstate p50/p90/p99 in the queue-time report. Consider using a more standard percentile index calculation (e.g., based on (n-1) * p/100 with appropriate interpolation/rounding) or statistics.quantiles/NumPy when available.
| index = min(int(len(ordered) * percent / 100), len(ordered) - 1) | |
| return ordered[index] | |
| clamped_percent = max(0, min(percent, 100)) | |
| position = (len(ordered) - 1) * clamped_percent / 100 | |
| lower_index = int(position) | |
| upper_index = min(lower_index + 1, len(ordered) - 1) | |
| if lower_index == upper_index: | |
| return ordered[lower_index] | |
| fraction = position - lower_index | |
| return ordered[lower_index] + (ordered[upper_index] - ordered[lower_index]) * fraction |
Resolve the AMD CI monitor merge conflicts while preserving the snapshot-based queue analytics changes. Address snapshot report time, snapshot lookback filtering, and percentile calculation so the branch stays merge-ready.
Format the queue monitor script so the CI style check passes without changing behavior.
Replace the queue-focused fleet report with concurrency metrics so runner saturation is easier to inspect by label in the CI monitor output.
Limit the concurrency summary to build-only-aiter and mi35x labels so the report focuses on the runner pools we actually care about and ignores hosted runner names.
* CI: move monitor scripts under .github/scripts Relocate CI monitor helper scripts to .github/scripts and update workflow trigger paths and execution paths accordingly. * CI: add snapshot-based PR monitor execution Fetch Actions data once per run and consume it in matrix and fleet reports to enable PR coverage while reducing API pressure and schedule input edge cases. * CI: add runner label queue time analytics * CI: fix Black formatting in queue monitor script Format the queue monitor script so the CI style check passes without changing behavior. * CI: add runner label concurrency summary Replace the queue-focused fleet report with concurrency metrics so runner saturation is easier to inspect by label in the CI monitor output. * CI: narrow runner label concurrency reporting Limit the concurrency summary to build-only-aiter and mi35x labels so the report focuses on the runner pools we actually care about and ignores hosted runner names.
Summary
Test plan
python3 -m py_compile .github/scripts/query_job_status.py .github/scripts/list_jobs.py.github/scripts/query_job_status.py --runner-report --summaryagainst a synthetic snapshot and verify the runner label queue summary and queue distribution tables are generated