Parity Auto Trigger: run parity.yml per upstream commit once CI finishes by ethanwee1 · Pull Request #3176 · ROCm/pytorch

ethanwee1 · 2026-04-23T14:30:25Z

Summary

Adds .github/workflows/parity-auto.yml so ROCm/pytorch automatically dispatches parity.yml for upstream pytorch/pytorch:main commits once the CI jobs needed for the parity report have finished.

The workflow currently:

Polls recent upstream commits on a cron and via workflow_dispatch.
Reads upstream check-runs for each SHA rather than relying on parent workflow status.
Waits for every in-scope ROCm parity test shard to be status=completed.
Waits for the CUDA jobs consumed by download_testlogs to be status=completed:
- linux-jammy-cuda13.0-py3.10-gcc11 / test-osdc (default, ...)
- linux-jammy-cuda13.0-py3.10-gcc11 / test-osdc (distributed, ...)
- unit-test / inductor-test / test (inductor, ...)
Dispatches parity.yml once for the ready, unprocessed arch subset.
Embeds the upstream SHA in csv_name/run title so the next scan can avoid duplicates.

The cron is set to every 10 minutes to reduce dispatch latency after upstream CI finishes.

Notable details

Readiness is based on check-run status=completed, not conclusion=success; failing test shards are still useful because they produce logs/artifacts.
ROCm readiness is scoped to the configured arch test-shard regexes, so unrelated ROCm benchmark/periodic jobs do not block parity reports.
CUDA default/distributed now uses the upstream OSDC CUDA jobs and test-reports-test-osdc-* artifact prefixes.
download_testlogs normalizes extracted CUDA OSDC artifact folders back to test-default-* / test-distributed-* so the existing XML summarizer keeps producing the same test_config values.

Testing on fork

This version has been deployed on ethanwee1/pytorch:main for live testing.

Recent successful scheduled auto-trigger runs on the latest fork head b490444...:

Recent parity reports dispatched by the auto-trigger after the latest fixes:

https://github.com/ethanwee1/pytorch/actions/runs/25165084598 — success
https://github.com/ethanwee1/pytorch/actions/runs/25161924236 — success
https://github.com/ethanwee1/pytorch/actions/runs/25146703765 — success
https://github.com/ethanwee1/pytorch/actions/runs/25171334003 — in progress at time of PR update

Earlier failures on the fork were from older revisions before the CSV field-size and CUDA OSDC fixes. The latest completed reports on the current fork head are green.

Follow-up after merge

After this lands on ROCm/pytorch develop, disable the fork cron to avoid duplicate polling/dispatching:

gh workflow disable parity-auto.yml --repo ethanwee1/pytorch

Adds .github/workflows/parity-auto.yml, which runs on a 30-minute cron (and workflow_dispatch for testing) and: 1. Pulls the most recent commits from pytorch/pytorch:main. 2. Skips commits that are too new (CI not started), too old (back-fill limit), or already have a parity.yml run in this repo (detected by matching the full SHA in prior run titles). 3. For the first remaining commit whose upstream check-runs are all "completed", dispatches parity.yml with that SHA so download_testlogs pulls the artifacts and logs for that exact build and generate_summary.py produces the per-arch report. csv_name is set to "autoparity-YYYYMMDD-<full SHA>" so the SHA ends up in the dispatched run's display title, which is what this workflow queries to avoid re-dispatching. Inputs expose max_commits, lookback_hours, max_age_hours, arch, and dry_run for tuning / debugging without code changes.

rocm-repo-management-api · 2026-04-23T15:01:25Z

Jenkins build for b926457d84f7482481aa8f1fdfced65247ad8882 commit finished as NOT_BUILT
Links: Pipeline Overview / Build artifacts / Test Results

…ow completion Previously the workflow used a blunt "all upstream check-runs completed" gate and dispatched parity.yml with a fixed arch list (mi355, mi300, mi200). That meant: * We blocked on hundreds of unrelated upstream check-runs (labeler bots, etc.). * We'd dispatch with arch="mi355, mi300, mi200" for a commit where only `trunk` had run, so mi300/mi200 had no data and the parity report came out nearly empty. Per-arch rewrite: * Query `repos/pytorch/pytorch/actions/runs?head_sha=<SHA>` to see which upstream workflows actually completed on the commit. * Map each arch to its default-tier upstream workflow (mi355→ trunk, mi300→rocm-mi300, mi200→trunk-rocm-sandbox, navi31→ rocm-navi31, nightly→rocm-nightly), exposed as `arch_workflow_map` input. * For each SHA newest→oldest, compute ready archs = archs whose required workflow is completed, minus archs already dispatched for that SHA (parsed from prior parity run titles after " · "). * If the remaining set is non-empty, dispatch parity.yml with arch=<that subset> and csv_name embedding the full SHA. Effect: mi355 gets a parity report per upstream commit (trunk runs per-commit). mi300/mi200 get dispatched separately whenever their less-frequent periodic workflow finishes on a given SHA. Each (SHA, arch) pair is dispatched at most once. Also adds a `target_ref` input so the dispatched parity.yml can run off a specific branch (useful for testing against a branch that has the up-to-date parity scripts while the workflow file itself lives on the default branch).

rocm-repo-management-api · 2026-04-23T17:02:20Z

Jenkins build for b926457d84f7482481aa8f1fdfced65247ad8882 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

The loop was silently aborting after printing 'no ready archs' for the first commit, because set -e was catching a non-zero exit in the next iteration (most likely date -u -d failing on an edge-case DATE string, or a gh api pagination call hitting a transient error). Drop -e (we already guard the pipelines that matter with || true), and make COMMIT_EPOCH fall back to 0 + skip the age check if date -d parsing fails.

…ult) GitHub Actions runs our script with 'shell: /usr/bin/bash -e {0}', so errexit is active from the shebang regardless of what we put in the script. 'set -uo pipefail' only adds options; it does not remove -e. Use 'set +e' before 'set -uo pipefail' so a non-zero exit from a pipe (grep -q with no match, etc.) in the middle of scanning multiple commits no longer silently kills the loop.

… dispatching

rocm-repo-management-api · 2026-04-29T16:45:40Z

Jenkins build for 7e331d97cb23b9ba937aa56d586a886740fd4a99 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

The auto-trigger previously waited for every ROCm check-run on an upstream SHA to complete before dispatching parity.yml, but download_testlogs also consumes CUDA default/distributed shards from trunk and CUDA inductor shards from the inductor workflow. If those CUDA jobs were still running, the parity report could be authored with partial CUDA data. Fetch all check-runs for the SHA, split out ROCm check-runs plus the CUDA test check-runs used by download_testlogs, and require the combined set to be status=completed before dispatching. Conclusions may still be failure; we only need the shards to have finished so their logs/artifacts are available.

rocm-repo-management-api · 2026-04-29T18:27:39Z

Jenkins build for bb046b388cdf6d2fa2f12fe8d0dc785aba3badd5 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

The CUDA readiness gate should wait for the jobs that parity.yml actually consumes, not every upstream check-run containing "rocm". Some unrelated ROCm benchmark/periodic jobs can still be pending on the same SHA and would otherwise block reports unnecessarily. Build the ROCm side of the gate from the configured per-arch test shard regexes, then combine that with CUDA default/distributed/inductor checks. This preserves the "wait until the jobs we compare are finished" invariant without waiting on unrelated ROCm jobs.

rocm-repo-management-api · 2026-04-29T18:43:08Z

Jenkins build for 3f0fa62ba5d8141b952cc3af902fd2331f791792 commit finished as NOT_BUILT
Links: Pipeline Overview / Build artifacts / Test Results

Upstream trunk now provides the CUDA default/distributed coverage we need through linux-jammy-cuda13.0-py3.10-gcc11 test-osdc shards rather than the older normal test shards. The old lookup matched test-osdc loosely as '/ test', then failed to find logs/artifacts because it still searched for '/ test (' job names and test-reports-test-default/distributed prefixes. Switch CUDA default/distributed log matching to test-osdc, use the test-reports-test-osdc-default/distributed artifact prefixes, and normalize extracted test-osdc artifact directories back to test-default/test-distributed so summarize_xml_testreports keeps assigning the existing test_config values. Also update parity-auto's CUDA readiness regex to wait for the same OSDC shards before dispatching.

rocm-repo-management-api · 2026-04-29T19:41:06Z

Jenkins build for 4d77e114a308071dd31fc1d665d44e3933d6f0bb commit finished as NOT_BUILT
Links: Pipeline Overview / Build artifacts / Test Results

The auto-trigger is lightweight API polling, and a 30 minute cron leaves too much latency after the last ROCm/CUDA parity shard finishes. Tighten the schedule to every 10 minutes so completed upstream commits are picked up sooner while still avoiding excessive schedule noise.

rocm-repo-management-api · 2026-04-29T20:32:31Z

Jenkins build for 4d77e114a308071dd31fc1d665d44e3933d6f0bb commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

rocm-repo-management-api · 2026-04-30T15:03:34Z

Jenkins build for be9768a43660294dcbb1187bc1ab07ff95cedefc commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

Detected error during Pytorch building:

/var/lib/jenkins/pytorch/aten/src/ATen/hip/CublasHandlePool.cpp:60:11: warning: enumeration value ‘rocblas_status_excluded_from_build’ not handled in switch [-Wswitch]
/var/lib/jenkins/pytorch/aten/src/ATen/hip/CublasHandlePool.cpp:60:11: warning: enumeration value ‘rocblas_status_arch_mismatch’ not handled in switch [-Wswitch]
[7625/8176] Building CXX object caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/cudnn/hip/BatchNorm.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[7626/8176] Building CXX object caffe2/CMakeFiles/torch_hip.dir/__/torch/csrc/distributed/c10d/UCCUtils.cpp.o
FAILED: caffe2/CMakeFiles/torch_hip.dir/__/torch/csrc/distributed/c10d/UCCUtils.cpp.o 
/opt/cache/bin/sccache /opt/cache/bin/c++ -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFLASHATTENTION_DISABLE_SOFTCAP -DFLASH_NAMESPACE=pytorch_flash -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_POSIX_FALLOCATE=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DHIPBLASLT_USE_ROCROLLER -DIDEEP_USE_MKL -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DROCM_VERSION=70202 -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_HIP_VERSION=702 -DUNFUSE_FMA -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_LAYERNORM_FAST_RECIPROCAL -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_PROF_API=1 -DUSE_ROCM -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -D__HIP_PLATFORM_AMD__ -D__HIP_PLATFORM_AMD__=1 -Dtorch_hip_EXPORTS -I/var/lib/jenkins/pytorch/build/aten/src -I/var/lib/jenkins/pytorch/aten/src -I/var/lib/jenkins/pytorch/build -I/var/lib/jenkins/pytorch -I/var/lib/jenkins/pytorch/nlohmann -I/var/lib/jenkins/pytorch/moodycamel -I/var/lib/jenkins/pytorch/aten/src/THH -I/var/lib/jenkins/pytorch/third_party/mslk/include -I/var/lib/jenkins/pytorch/aten/src/ATen/hip -I/var/lib/jenkins/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include -I/var/lib/jenkins/pytorch/aten/src/ATen/../../../third_party/composable_kernel/library/include -I/var/lib/jenkins/pytorch/aten/src/ATen/../../../third_party/composable_kernel/example/ck_tile/01_fmha -I/var/lib/jenkins/pytorch/build/caffe2/aten/src/ATen/composable_kernel -I/var/lib/jenkins/pytorch/aten/src/ATen/../../../third_party/aiter/csrc/include -I/var/lib/jenkins/pytorch/third_party/fmt/include -I/var/lib/jenkins/pytorch/build/caffe2/aten/src -I/var/lib/jenkins/pytorch/aten/src/ATen/.. -I/var/lib/jenkins/pytorch/torch/include -I/var/lib/jenkins/pytorch/c10/hip/../.. -I/var/lib/jenkins/pytorch/c10/.. -I/var/lib/jenkins/pytorch/torch/csrc/api -I/var/lib/jenkins/pytorch/torch/csrc/api/include -I/var/lib/jenkins/pytorch/build/third_party/gloo/hip -isystem /opt/rocm-7.2.2/include -isystem /var/lib/jenkins/pytorch/build/third_party/gloo -isystem /var/lib/jenkins/pytorch/cmake/../third_party/gloo -isystem /var/lib/jenkins/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /var/lib/jenkins/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /var/lib/jenkins/pytorch/cmake/../third_party/googletest/googletest/include -isystem /var/lib/jenkins/pytorch/third_party/protobuf/src -isystem /opt/conda/envs/py_3.12/include -isystem /var/lib/jenkins/pytorch/third_party/XNNPACK/include -isystem /var/lib/jenkins/pytorch/third_party/ittapi/include -isystem /var/lib/jenkins/pytorch/cmake/../third_party/eigen -isystem /opt/rocm/include -isystem /var/lib/jenkins/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /var/lib/jenkins/pytorch/third_party/ideep/include -isystem /var/lib/jenkins/pytorch/INTERFACE -isystem /var/lib/jenkins/pytorch/third_party/nlohmann/include -isystem /var/lib/jenkins/pytorch/third_party/concurrentqueue -isystem /opt/rocm-7.2.2/include/hiprand -isystem /opt/rocm-7.2.2/include/rocrand -isystem /opt/rocm/magma/include -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_MSLK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -DC10_NODEPRECATED -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -faligned-new -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-dangling-reference -Wno-error=dangling-reference -Wno-stringop-overflow -DHAVE_AVX512_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3 -DNDEBUG -DNDEBUG -fPIC -fdiagnostics-color=always -DMKL_HAS_SBGEMM -DMKL_HAS_SHGEMM -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Wall -Wextra -Wdeprecated -Wunused -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wredundant-move -Wno-interference-size -Wno-maybe-uninitialized -fvisibility=hidden -fPIC -D__HIP_PLATFORM_AMD__=1 -DCUDA_HAS_FP16=1 -DUSE_ROCM -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DTORCH_HIP_VERSION=702 -Wno-shift-count-negative -Wno-shift-count-overflow -DCAFFE2_USE_MIOPEN -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP -DHIPBLAS_V2 -DHIP_ENABLE_WARP_SYNC_BUILTINS -DHIPBLASLT_OUTER_VEC -DUSE_ROCM_CK_GEMM -DHIP_VERSION=7 -Wno-duplicate-decl-specifier -DUSE_MIOPEN -MD -MT caffe2/CMakeFiles/torch_hip.dir/__/torch/csrc/distributed/c10d/UCCUtils.cpp.o -MF caffe2/CMakeFiles/torch_hip.dir/__/torch/csrc/distributed/c10d/UCCUtils.cpp.o.d -o caffe2/CMakeFiles/torch_hip.dir/__/torch/csrc/distributed/c10d/UCCUtils.cpp.o -c /var/lib/jenkins/pytorch/torch/csrc/distributed/c10d/UCCUtils.cpp
sccache: encountered fatal error
sccache: error : corrupt deflate stream
sccache:  cause: corrupt deflate stream
[7627/8176] Building CXX object caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/hip/HIPSparseDescriptors.cpp.o

Base automatically changed from parity-summary-improvements to develop April 23, 2026 14:35

ethanwee1 force-pushed the parity-auto-trigger branch from dbe8b5f to 82e48da Compare April 23, 2026 14:38

ethanwee1 marked this pull request as draft April 29, 2026 16:22

ethanwee1 added 4 commits April 29, 2026 16:30

parity-auto: gate dispatch on per-arch test check-runs being complete

946a729

parity-auto: require every ROCm check-run on the SHA to finish before…

7e331d9

… dispatching

parity-auto: backfill ready commits without duplicates

be9768a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parity Auto Trigger: run parity.yml per upstream commit once CI finishes#3176

Parity Auto Trigger: run parity.yml per upstream commit once CI finishes#3176
ethanwee1 wants to merge 11 commits intodevelopfrom
parity-auto-trigger

ethanwee1 commented Apr 23, 2026 •

edited

Loading

Uh oh!

rocm-repo-management-api Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

rocm-repo-management-api Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

rocm-repo-management-api Bot commented Apr 29, 2026 •

edited

Loading

Uh oh!

rocm-repo-management-api Bot commented Apr 29, 2026 •

edited

Loading

Uh oh!

rocm-repo-management-api Bot commented Apr 29, 2026 •

edited

Loading

Uh oh!

rocm-repo-management-api Bot commented Apr 29, 2026 •

edited

Loading

Uh oh!

rocm-repo-management-api Bot commented Apr 29, 2026 •

edited

Loading

Uh oh!

rocm-repo-management-api Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ethanwee1 commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Notable details

Testing on fork

Follow-up after merge

Uh oh!

rocm-repo-management-api Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ethanwee1 commented Apr 23, 2026 •

edited

Loading

rocm-repo-management-api Bot commented Apr 23, 2026 •

edited

Loading

rocm-repo-management-api Bot commented Apr 23, 2026 •

edited

Loading

rocm-repo-management-api Bot commented Apr 29, 2026 •

edited

Loading

rocm-repo-management-api Bot commented Apr 29, 2026 •

edited

Loading

rocm-repo-management-api Bot commented Apr 29, 2026 •

edited

Loading

rocm-repo-management-api Bot commented Apr 29, 2026 •

edited

Loading

rocm-repo-management-api Bot commented Apr 29, 2026 •

edited

Loading

rocm-repo-management-api Bot commented Apr 30, 2026 •

edited

Loading