CI: add runner label queue time analytics by gyohuangxin · Pull Request #2606 · ROCm/aiter

gyohuangxin · 2026-04-03T07:46:57Z

Summary

add runner-label-based queue time analytics to the AMD CI monitor report
capture job labels and timestamps in the snapshot so queue and duration metrics can be computed from cached data
rename the runner fleet summary step to highlight queue-focused reporting

Test plan

python3 -m py_compile .github/scripts/query_job_status.py .github/scripts/list_jobs.py
run .github/scripts/query_job_status.py --runner-report --summary against a synthetic snapshot and verify the runner label queue summary and queue distribution tables are generated

Relocate CI monitor helper scripts to .github/scripts and update workflow trigger paths and execution paths accordingly.

Fetch Actions data once per run and consume it in matrix and fleet reports to enable PR coverage while reducing API pressure and schedule input edge cases.

github-actions · 2026-04-03T07:47:46Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-355`	Run Triton tests on MI355 in addition to MI325
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 2606 --add-label <label>

Copilot

Pull request overview

This PR enhances the AMD CI job monitor workflow by introducing runner-label-based queue time analytics, and by switching reporting to operate from a cached “actions snapshot” artifact so queue/duration metrics can be computed without re-querying the GitHub API per job.

Changes:

Added snapshot capture/consume flow in amd-ci-job-monitor.yml so per-job reports and runner fleet reports can be generated from cached job metadata.
Replaced the previous scripts/ci/query_job_status.py with an updated .github/scripts/query_job_status.py that records timestamps/labels and computes queue-time and duration analytics.
Updated .github/scripts/list_jobs.py to emit both job IDs and stable display names for matrix generation.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
`scripts/ci/query_job_status.py`	Removed the legacy job status reporter script (replaced by the `.github/scripts/` version).
`.github/workflows/amd-ci-job-monitor.yml`	Adds permissions, snapshot fetch/upload, and switches report generation to consume the snapshot artifact; renames runner fleet step to emphasize queue reporting.
`.github/scripts/query_job_status.py`	New reporter supporting snapshot in/out, runner-label grouping, and queue/duration analytics tables.
`.github/scripts/list_jobs.py`	Improves workflow job discovery by extracting stable job display names and including job IDs in the matrix.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-03T07:54:09Z

+
+    all_rows: list[dict[str, Any]] = []
+    if args.snapshot_in:
+        snapshot_payload = json.loads(Path(args.snapshot_in).read_text(encoding="utf-8"))


When loading a snapshot via --snapshot-in, report_time is still set to datetime.now(...). This makes queue-time calculations for queued/waiting jobs depend on when the report is generated, not when the snapshot was captured, which undermines the goal of computing metrics from cached snapshot data. Consider reading generated_at from the snapshot payload and using that timestamp as report_time (fallback to now if missing/invalid).

Suggested change

snapshot_payload = json.loads(Path(args.snapshot_in).read_text(encoding="utf-8"))

snapshot_payload = json.loads(Path(args.snapshot_in).read_text(encoding="utf-8"))

snapshot_generated_at = snapshot_payload.get("generated_at")

if isinstance(snapshot_generated_at, str):

try:

parsed_report_time = datetime.fromisoformat(

snapshot_generated_at.replace("Z", "+00:00")

)

if parsed_report_time.tzinfo is None:

report_time = parsed_report_time.replace(tzinfo=timezone.utc)

else:

report_time = parsed_report_time.astimezone(timezone.utc)

except ValueError:

pass

Copilot · 2026-04-03T07:54:09Z

+
+    workflow_set = set(workflows)
+    job_rows = [
+        row
+        for row in all_rows
+        if row.get("workflow") in workflow_set
+        and job_name_matches(args.job, row.get("job", ""))


--hours is effectively ignored when --snapshot-in is provided: lookback is computed but no time-based filtering is applied to all_rows / job_rows. This can produce reports that don’t match the requested time window and makes snapshot-driven runs dependent on how the snapshot was originally created. Consider filtering all_rows (or job_rows) by created_at (>= lookback) regardless of whether data came from live API calls or a snapshot.

Suggested change

workflow_set = set(workflows)

job_rows = [

row

for row in all_rows

if row.get("workflow") in workflow_set

and job_name_matches(args.job, row.get("job", ""))

def parse_row_created_at(row: dict[str, Any]) -> datetime | None:

created_at = row.get("created_at")

if not created_at:

return None

if isinstance(created_at, datetime):

return (

created_at

if created_at.tzinfo is not None

else created_at.replace(tzinfo=timezone.utc)

)

if not isinstance(created_at, str):

return None

try:

return datetime.fromisoformat(created_at.replace("Z", "+00:00"))

except ValueError:

return None

workflow_set = set(workflows)

job_rows = [

row

for row in all_rows

if row.get("workflow") in workflow_set

and job_name_matches(args.job, row.get("job", ""))

and (created_at := parse_row_created_at(row)) is not None

and created_at >= lookback

Copilot · 2026-04-03T07:54:09Z

+    index = min(int(len(ordered) * percent / 100), len(ordered) - 1)
+    return ordered[index]


The percentile() implementation uses int(len(values) * percent / 100) to pick an index. For common cases (e.g., 90th percentile with 10 samples), this selects the maximum (index 9) rather than the 90th-percentile position (around index 8), which will systematically overstate p50/p90/p99 in the queue-time report. Consider using a more standard percentile index calculation (e.g., based on (n-1) * p/100 with appropriate interpolation/rounding) or statistics.quantiles/NumPy when available.

Suggested change

index = min(int(len(ordered) * percent / 100), len(ordered) - 1)

return ordered[index]

clamped_percent = max(0, min(percent, 100))

position = (len(ordered) - 1) * clamped_percent / 100

lower_index = int(position)

upper_index = min(lower_index + 1, len(ordered) - 1)

if lower_index == upper_index:

return ordered[lower_index]

fraction = position - lower_index

return ordered[lower_index] + (ordered[upper_index] - ordered[lower_index]) * fraction

Resolve the AMD CI monitor merge conflicts while preserving the snapshot-based queue analytics changes. Address snapshot report time, snapshot lookback filtering, and percentile calculation so the branch stays merge-ready.

Format the queue monitor script so the CI style check passes without changing behavior.

Replace the queue-focused fleet report with concurrency metrics so runner saturation is easier to inspect by label in the CI monitor output.

Limit the concurrency summary to build-only-aiter and mi35x labels so the report focuses on the runner pools we actually care about and ignores hosted runner names.

* CI: move monitor scripts under .github/scripts Relocate CI monitor helper scripts to .github/scripts and update workflow trigger paths and execution paths accordingly. * CI: add snapshot-based PR monitor execution Fetch Actions data once per run and consume it in matrix and fleet reports to enable PR coverage while reducing API pressure and schedule input edge cases. * CI: add runner label queue time analytics * CI: fix Black formatting in queue monitor script Format the queue monitor script so the CI style check passes without changing behavior. * CI: add runner label concurrency summary Replace the queue-focused fleet report with concurrency metrics so runner saturation is easier to inspect by label in the CI monitor output. * CI: narrow runner label concurrency reporting Limit the concurrency summary to build-only-aiter and mi35x labels so the report focuses on the runner pools we actually care about and ignores hosted runner names.

gyohuangxin added 3 commits April 1, 2026 16:19

CI: move monitor scripts under .github/scripts

f2b6f81

Relocate CI monitor helper scripts to .github/scripts and update workflow trigger paths and execution paths accordingly.

CI: add snapshot-based PR monitor execution

0518e8d

Fetch Actions data once per run and consume it in matrix and fleet reports to enable PR coverage while reducing API pressure and schedule input edge cases.

CI: add runner label queue time analytics

c3f617b

gyohuangxin requested review from a team and Copilot April 3, 2026 07:46

Copilot started reviewing on behalf of gyohuangxin April 3, 2026 07:49 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

gyohuangxin added 4 commits April 3, 2026 17:09

Merge main into ci/aiter-monitor-queue-labels

1e03573

Resolve the AMD CI monitor merge conflicts while preserving the snapshot-based queue analytics changes. Address snapshot report time, snapshot lookback filtering, and percentile calculation so the branch stays merge-ready.

CI: fix Black formatting in queue monitor script

6018762

Format the queue monitor script so the CI style check passes without changing behavior.

CI: add runner label concurrency summary

ecd6322

Replace the queue-focused fleet report with concurrency metrics so runner saturation is easier to inspect by label in the CI monitor output.

CI: narrow runner label concurrency reporting

72056a0

Limit the concurrency summary to build-only-aiter and mi35x labels so the report focuses on the runner pools we actually care about and ignores hosted runner names.

amd-ruitang3 added the ci:atom label Apr 5, 2026

Merge branch 'main' into ci/aiter-monitor-queue-labels

fe01e0e

gyohuangxin merged commit 88e7447 into main Apr 7, 2026
50 of 53 checks passed

gyohuangxin deleted the ci/aiter-monitor-queue-labels branch April 7, 2026 06:38

sunway513 mentioned this pull request Apr 9, 2026

AITER Release Process: Gap Analysis and Improvement Plan sunway513/aiter#58

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: add runner label queue time analytics#2606

CI: add runner label queue time analytics#2606
gyohuangxin merged 8 commits intomainfrom
ci/aiter-monitor-queue-labels

gyohuangxin commented Apr 3, 2026

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-        snapshot_payload = json.loads(Path(args.snapshot_in).read_text(encoding="utf-8"))
+        snapshot_payload = json.loads(Path(args.snapshot_in).read_text(encoding="utf-8"))
+        snapshot_generated_at = snapshot_payload.get("generated_at")
+        if isinstance(snapshot_generated_at, str):
+            try:
+                parsed_report_time = datetime.fromisoformat(
+                    snapshot_generated_at.replace("Z", "+00:00")
+                )
+                if parsed_report_time.tzinfo is None:
+                    report_time = parsed_report_time.replace(tzinfo=timezone.utc)
+                else:
+                    report_time = parsed_report_time.astimezone(timezone.utc)
+            except ValueError:
+                pass

-    workflow_set = set(workflows)
-    job_rows = [
-        row
-        for row in all_rows
-        if row.get("workflow") in workflow_set
-        and job_name_matches(args.job, row.get("job", ""))
+    def parse_row_created_at(row: dict[str, Any]) -> datetime | None:
+        created_at = row.get("created_at")
+        if not created_at:
+            return None
+        if isinstance(created_at, datetime):
+            return (
+                created_at
+                if created_at.tzinfo is not None
+                else created_at.replace(tzinfo=timezone.utc)
+            )
+        if not isinstance(created_at, str):
+            return None
+        try:
+            return datetime.fromisoformat(created_at.replace("Z", "+00:00"))
+        except ValueError:
+            return None
+    workflow_set = set(workflows)
+    job_rows = [
+        row
+        for row in all_rows
+        if row.get("workflow") in workflow_set
+        and job_name_matches(args.job, row.get("job", ""))
+        and (created_at := parse_row_created_at(row)) is not None
+        and created_at >= lookback

		index = min(int(len(ordered) * percent / 100), len(ordered) - 1)
		return ordered[index]

-    index = min(int(len(ordered) * percent / 100), len(ordered) - 1)
-    return ordered[index]
+    clamped_percent = max(0, min(percent, 100))
+    position = (len(ordered) - 1) * clamped_percent / 100
+    lower_index = int(position)
+    upper_index = min(lower_index + 1, len(ordered) - 1)
+    if lower_index == upper_index:
+        return ordered[lower_index]
+    fraction = position - lower_index
+    return ordered[lower_index] + (ordered[upper_index] - ordered[lower_index]) * fraction

Conversation

gyohuangxin commented Apr 3, 2026

Summary

Test plan

Uh oh!

github-actions bot commented Apr 3, 2026

🏷️ CI Guide

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants