Skip to content

CI: add runner label queue time analytics#2606

Merged
gyohuangxin merged 8 commits intomainfrom
ci/aiter-monitor-queue-labels
Apr 7, 2026
Merged

CI: add runner label queue time analytics#2606
gyohuangxin merged 8 commits intomainfrom
ci/aiter-monitor-queue-labels

Conversation

@gyohuangxin
Copy link
Copy Markdown
Member

Summary

  • add runner-label-based queue time analytics to the AMD CI monitor report
  • capture job labels and timestamps in the snapshot so queue and duration metrics can be computed from cached data
  • rename the runner fleet summary step to highlight queue-focused reporting

Test plan

  • python3 -m py_compile .github/scripts/query_job_status.py .github/scripts/list_jobs.py
  • run .github/scripts/query_job_status.py --runner-report --summary against a synthetic snapshot and verify the runner label queue summary and queue distribution tables are generated

Relocate CI monitor helper scripts to .github/scripts and update workflow trigger paths and execution paths accordingly.
Fetch Actions data once per run and consume it in matrix and fleet reports to enable PR coverage while reducing API pressure and schedule input edge cases.
@gyohuangxin gyohuangxin requested review from a team and Copilot April 3, 2026 07:46
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 3, 2026

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-355 Run Triton tests on MI355 in addition to MI325
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2606 --add-label <label>

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the AMD CI job monitor workflow by introducing runner-label-based queue time analytics, and by switching reporting to operate from a cached “actions snapshot” artifact so queue/duration metrics can be computed without re-querying the GitHub API per job.

Changes:

  • Added snapshot capture/consume flow in amd-ci-job-monitor.yml so per-job reports and runner fleet reports can be generated from cached job metadata.
  • Replaced the previous scripts/ci/query_job_status.py with an updated .github/scripts/query_job_status.py that records timestamps/labels and computes queue-time and duration analytics.
  • Updated .github/scripts/list_jobs.py to emit both job IDs and stable display names for matrix generation.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
scripts/ci/query_job_status.py Removed the legacy job status reporter script (replaced by the .github/scripts/ version).
.github/workflows/amd-ci-job-monitor.yml Adds permissions, snapshot fetch/upload, and switches report generation to consume the snapshot artifact; renames runner fleet step to emphasize queue reporting.
.github/scripts/query_job_status.py New reporter supporting snapshot in/out, runner-label grouping, and queue/duration analytics tables.
.github/scripts/list_jobs.py Improves workflow job discovery by extracting stable job display names and including job IDs in the matrix.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/scripts/query_job_status.py Outdated

all_rows: list[dict[str, Any]] = []
if args.snapshot_in:
snapshot_payload = json.loads(Path(args.snapshot_in).read_text(encoding="utf-8"))
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When loading a snapshot via --snapshot-in, report_time is still set to datetime.now(...). This makes queue-time calculations for queued/waiting jobs depend on when the report is generated, not when the snapshot was captured, which undermines the goal of computing metrics from cached snapshot data. Consider reading generated_at from the snapshot payload and using that timestamp as report_time (fallback to now if missing/invalid).

Suggested change
snapshot_payload = json.loads(Path(args.snapshot_in).read_text(encoding="utf-8"))
snapshot_payload = json.loads(Path(args.snapshot_in).read_text(encoding="utf-8"))
snapshot_generated_at = snapshot_payload.get("generated_at")
if isinstance(snapshot_generated_at, str):
try:
parsed_report_time = datetime.fromisoformat(
snapshot_generated_at.replace("Z", "+00:00")
)
if parsed_report_time.tzinfo is None:
report_time = parsed_report_time.replace(tzinfo=timezone.utc)
else:
report_time = parsed_report_time.astimezone(timezone.utc)
except ValueError:
pass

Copilot uses AI. Check for mistakes.
Comment on lines +501 to +507

workflow_set = set(workflows)
job_rows = [
row
for row in all_rows
if row.get("workflow") in workflow_set
and job_name_matches(args.job, row.get("job", ""))
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--hours is effectively ignored when --snapshot-in is provided: lookback is computed but no time-based filtering is applied to all_rows / job_rows. This can produce reports that don’t match the requested time window and makes snapshot-driven runs dependent on how the snapshot was originally created. Consider filtering all_rows (or job_rows) by created_at (>= lookback) regardless of whether data came from live API calls or a snapshot.

Suggested change
workflow_set = set(workflows)
job_rows = [
row
for row in all_rows
if row.get("workflow") in workflow_set
and job_name_matches(args.job, row.get("job", ""))
def parse_row_created_at(row: dict[str, Any]) -> datetime | None:
created_at = row.get("created_at")
if not created_at:
return None
if isinstance(created_at, datetime):
return (
created_at
if created_at.tzinfo is not None
else created_at.replace(tzinfo=timezone.utc)
)
if not isinstance(created_at, str):
return None
try:
return datetime.fromisoformat(created_at.replace("Z", "+00:00"))
except ValueError:
return None
workflow_set = set(workflows)
job_rows = [
row
for row in all_rows
if row.get("workflow") in workflow_set
and job_name_matches(args.job, row.get("job", ""))
and (created_at := parse_row_created_at(row)) is not None
and created_at >= lookback

Copilot uses AI. Check for mistakes.
Comment thread .github/scripts/query_job_status.py Outdated
Comment on lines +288 to +289
index = min(int(len(ordered) * percent / 100), len(ordered) - 1)
return ordered[index]
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The percentile() implementation uses int(len(values) * percent / 100) to pick an index. For common cases (e.g., 90th percentile with 10 samples), this selects the maximum (index 9) rather than the 90th-percentile position (around index 8), which will systematically overstate p50/p90/p99 in the queue-time report. Consider using a more standard percentile index calculation (e.g., based on (n-1) * p/100 with appropriate interpolation/rounding) or statistics.quantiles/NumPy when available.

Suggested change
index = min(int(len(ordered) * percent / 100), len(ordered) - 1)
return ordered[index]
clamped_percent = max(0, min(percent, 100))
position = (len(ordered) - 1) * clamped_percent / 100
lower_index = int(position)
upper_index = min(lower_index + 1, len(ordered) - 1)
if lower_index == upper_index:
return ordered[lower_index]
fraction = position - lower_index
return ordered[lower_index] + (ordered[upper_index] - ordered[lower_index]) * fraction

Copilot uses AI. Check for mistakes.
Resolve the AMD CI monitor merge conflicts while preserving the snapshot-based queue analytics changes. Address snapshot report time, snapshot lookback filtering, and percentile calculation so the branch stays merge-ready.
Format the queue monitor script so the CI style check passes without changing behavior.
Replace the queue-focused fleet report with concurrency metrics so runner saturation is easier to inspect by label in the CI monitor output.
Limit the concurrency summary to build-only-aiter and mi35x labels so the report focuses on the runner pools we actually care about and ignores hosted runner names.
@gyohuangxin gyohuangxin merged commit 88e7447 into main Apr 7, 2026
50 of 53 checks passed
@gyohuangxin gyohuangxin deleted the ci/aiter-monitor-queue-labels branch April 7, 2026 06:38
yzhou103 pushed a commit that referenced this pull request Apr 8, 2026
* CI: move monitor scripts under .github/scripts

Relocate CI monitor helper scripts to .github/scripts and update workflow trigger paths and execution paths accordingly.

* CI: add snapshot-based PR monitor execution

Fetch Actions data once per run and consume it in matrix and fleet reports to enable PR coverage while reducing API pressure and schedule input edge cases.

* CI: add runner label queue time analytics

* CI: fix Black formatting in queue monitor script

Format the queue monitor script so the CI style check passes without changing behavior.

* CI: add runner label concurrency summary

Replace the queue-focused fleet report with concurrency metrics so runner saturation is easier to inspect by label in the CI monitor output.

* CI: narrow runner label concurrency reporting

Limit the concurrency summary to build-only-aiter and mi35x labels so the report focuses on the runner pools we actually care about and ignores hosted runner names.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants