Parity Auto Trigger: run parity.yml per upstream commit once CI finishes#3176
Parity Auto Trigger: run parity.yml per upstream commit once CI finishes#3176
Conversation
Adds .github/workflows/parity-auto.yml, which runs on a 30-minute
cron (and workflow_dispatch for testing) and:
1. Pulls the most recent commits from pytorch/pytorch:main.
2. Skips commits that are too new (CI not started), too old
(back-fill limit), or already have a parity.yml run in this
repo (detected by matching the full SHA in prior run titles).
3. For the first remaining commit whose upstream check-runs are
all "completed", dispatches parity.yml with that SHA so
download_testlogs pulls the artifacts and logs for that exact
build and generate_summary.py produces the per-arch report.
csv_name is set to "autoparity-YYYYMMDD-<full SHA>" so the SHA ends
up in the dispatched run's display title, which is what this
workflow queries to avoid re-dispatching.
Inputs expose max_commits, lookback_hours, max_age_hours, arch, and
dry_run for tuning / debugging without code changes.
dbe8b5f to
82e48da
Compare
|
Jenkins build for b926457d84f7482481aa8f1fdfced65247ad8882 commit finished as NOT_BUILT |
…ow completion
Previously the workflow used a blunt "all upstream check-runs
completed" gate and dispatched parity.yml with a fixed arch list
(mi355, mi300, mi200). That meant:
* We blocked on hundreds of unrelated upstream check-runs
(labeler bots, etc.).
* We'd dispatch with arch="mi355, mi300, mi200" for a commit
where only `trunk` had run, so mi300/mi200 had no data and
the parity report came out nearly empty.
Per-arch rewrite:
* Query `repos/pytorch/pytorch/actions/runs?head_sha=<SHA>` to
see which upstream workflows actually completed on the commit.
* Map each arch to its default-tier upstream workflow (mi355→
trunk, mi300→rocm-mi300, mi200→trunk-rocm-sandbox, navi31→
rocm-navi31, nightly→rocm-nightly), exposed as
`arch_workflow_map` input.
* For each SHA newest→oldest, compute ready archs = archs whose
required workflow is completed, minus archs already dispatched
for that SHA (parsed from prior parity run titles after " · ").
* If the remaining set is non-empty, dispatch parity.yml with
arch=<that subset> and csv_name embedding the full SHA.
Effect: mi355 gets a parity report per upstream commit (trunk
runs per-commit). mi300/mi200 get dispatched separately whenever
their less-frequent periodic workflow finishes on a given SHA.
Each (SHA, arch) pair is dispatched at most once.
Also adds a `target_ref` input so the dispatched parity.yml can
run off a specific branch (useful for testing against a branch
that has the up-to-date parity scripts while the workflow file
itself lives on the default branch).
|
Jenkins build for b926457d84f7482481aa8f1fdfced65247ad8882 commit finished as FAILURE |
The loop was silently aborting after printing 'no ready archs' for the first commit, because set -e was catching a non-zero exit in the next iteration (most likely date -u -d failing on an edge-case DATE string, or a gh api pagination call hitting a transient error). Drop -e (we already guard the pipelines that matter with || true), and make COMMIT_EPOCH fall back to 0 + skip the age check if date -d parsing fails.
…ult)
GitHub Actions runs our script with 'shell: /usr/bin/bash -e {0}', so
errexit is active from the shebang regardless of what we put in the
script. 'set -uo pipefail' only adds options; it does not remove -e.
Use 'set +e' before 'set -uo pipefail' so a non-zero exit from a pipe
(grep -q with no match, etc.) in the middle of scanning multiple
commits no longer silently kills the loop.
|
Jenkins build for 7e331d97cb23b9ba937aa56d586a886740fd4a99 commit finished as FAILURE |
The auto-trigger previously waited for every ROCm check-run on an upstream SHA to complete before dispatching parity.yml, but download_testlogs also consumes CUDA default/distributed shards from trunk and CUDA inductor shards from the inductor workflow. If those CUDA jobs were still running, the parity report could be authored with partial CUDA data. Fetch all check-runs for the SHA, split out ROCm check-runs plus the CUDA test check-runs used by download_testlogs, and require the combined set to be status=completed before dispatching. Conclusions may still be failure; we only need the shards to have finished so their logs/artifacts are available.
|
Jenkins build for bb046b388cdf6d2fa2f12fe8d0dc785aba3badd5 commit finished as FAILURE |
The CUDA readiness gate should wait for the jobs that parity.yml actually consumes, not every upstream check-run containing "rocm". Some unrelated ROCm benchmark/periodic jobs can still be pending on the same SHA and would otherwise block reports unnecessarily. Build the ROCm side of the gate from the configured per-arch test shard regexes, then combine that with CUDA default/distributed/inductor checks. This preserves the "wait until the jobs we compare are finished" invariant without waiting on unrelated ROCm jobs.
|
Jenkins build for 3f0fa62ba5d8141b952cc3af902fd2331f791792 commit finished as NOT_BUILT |
Upstream trunk now provides the CUDA default/distributed coverage we need through linux-jammy-cuda13.0-py3.10-gcc11 test-osdc shards rather than the older normal test shards. The old lookup matched test-osdc loosely as '/ test', then failed to find logs/artifacts because it still searched for '/ test (' job names and test-reports-test-default/distributed prefixes.
Switch CUDA default/distributed log matching to test-osdc, use the test-reports-test-osdc-default/distributed artifact prefixes, and normalize extracted test-osdc artifact directories back to test-default/test-distributed so summarize_xml_testreports keeps assigning the existing test_config values. Also update parity-auto's CUDA readiness regex to wait for the same OSDC shards before dispatching.
|
Jenkins build for 4d77e114a308071dd31fc1d665d44e3933d6f0bb commit finished as NOT_BUILT |
The auto-trigger is lightweight API polling, and a 30 minute cron leaves too much latency after the last ROCm/CUDA parity shard finishes. Tighten the schedule to every 10 minutes so completed upstream commits are picked up sooner while still avoiding excessive schedule noise.
|
Jenkins build for 4d77e114a308071dd31fc1d665d44e3933d6f0bb commit finished as FAILURE |
|
Jenkins build for be9768a43660294dcbb1187bc1ab07ff95cedefc commit finished as FAILURE Detected error during Pytorch building: |
Summary
Adds
.github/workflows/parity-auto.ymlso ROCm/pytorch automatically dispatchesparity.ymlfor upstreampytorch/pytorch:maincommits once the CI jobs needed for the parity report have finished.The workflow currently:
workflow_dispatch.status=completed.download_testlogsto bestatus=completed:linux-jammy-cuda13.0-py3.10-gcc11 / test-osdc (default, ...)linux-jammy-cuda13.0-py3.10-gcc11 / test-osdc (distributed, ...)unit-test / inductor-test / test (inductor, ...)parity.ymlonce for the ready, unprocessed arch subset.csv_name/run title so the next scan can avoid duplicates.The cron is set to every 10 minutes to reduce dispatch latency after upstream CI finishes.
Notable details
status=completed, notconclusion=success; failing test shards are still useful because they produce logs/artifacts.test-reports-test-osdc-*artifact prefixes.download_testlogsnormalizes extracted CUDA OSDC artifact folders back totest-default-*/test-distributed-*so the existing XML summarizer keeps producing the sametest_configvalues.Testing on fork
This version has been deployed on
ethanwee1/pytorch:mainfor live testing.Recent successful scheduled auto-trigger runs on the latest fork head
b490444...:Recent parity reports dispatched by the auto-trigger after the latest fixes:
Earlier failures on the fork were from older revisions before the CSV field-size and CUDA OSDC fixes. The latest completed reports on the current fork head are green.
Follow-up after merge
After this lands on ROCm/pytorch
develop, disable the fork cron to avoid duplicate polling/dispatching: