[CI] Show only ROCm failures in parity summary and add cross-arch column by ethanwee1 · Pull Request #3153 · ROCm/pytorch

ethanwee1 · 2026-04-14T16:02:37Z

Summary

Only display tests where ROCm status is FAILED in the summary (CUDA status shown as a context column alongside). Previously both ROCm and CUDA failures were shown.
Add "Also Failing In" column that shows which other architectures have the same test tuple (test_file, test_class, test_name) failing, making it easy to distinguish all-ROCm issues from architecture-specific ones.
Includes count of failed tests in the section header.
Add job-level and test-level shard info to "LOG-BASED FAILURES (not in XML)" and "FAILED TESTS" section
Includes flaky tests in "LOG-BASED FAILURES (not in XML)" section for any tests that pass when run in new process

Test plan

Cross-arch detection confirmed: tests failing on all 3 archs show the other 2 in "Also Failing In"; single-arch failures show empty
CSV and Markdown output both updated consistently
Latest run https://github.com/ROCm/pytorch/actions/runs/24798004968
Run without this PR on the same commit: https://github.com/ROCm/pytorch/actions/runs/24796654604

Only display tests where ROCm status is FAILED (CUDA status shown as context column). Add cross-architecture lookup so each failure shows which other architectures have the same test failing.

rocm-repo-management-api · 2026-04-14T16:07:43Z

Jenkins build for fad16aed2f09fcc9366270a28d73d228b5629220 commit finished as NOT_BUILT
Links: Pipeline Overview / Build artifacts / Test Results

Remove cuda, cuda_dist, cuda_inductor, and baseline entries from LOG_FILE_MAP since only ROCm failures are relevant to the parity report.

rocm-repo-management-api · 2026-04-14T16:40:08Z

Jenkins build for fad16aed2f09fcc9366270a28d73d228b5629220 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

- Restore CUDA and baseline log parsing so their failures can be cross- referenced, but keep the LOG-BASED FAILURES table's Arch column limited to ROCm entries (CUDA rows are hidden from the table itself). - Add "Also Failing In" column to LOG-BASED FAILURES and include "cuda" in the FAILED TESTS "Also Failing In" column when a CUDA log failure exists for the same test tuple. This lets us spot tests failing on both platforms so we can revert upstream changes instead of filing a ROCm DISABLED issue. - Split the single Shard column in FAILED TESTS into Shard (rocm) and Shard (cuda) so each failure can be looked up in either CI job. - Propagate the active test-file shard to CONSISTENT_FAILURE log entries so shard info is no longer blank in the log-based failures table.

rocm-repo-management-api · 2026-04-20T22:27:57Z

Jenkins build for 902d7cfa0b3c35044138c88548e5991bf5c43049 commit finished as ABORTED
Links: Pipeline Overview / Build artifacts / Test Results

- detect_log_failures.py now computes job-level shard totals by counting log files per (platform, test_config) and emits both job_shard (e.g. 3/6, derived from filename + file count) and test_shard (e.g. 10/15, the intra-file pytest "Running ... N/M" value) for each failure, including CONSISTENT_FAILURE entries. - generate_summary.py LOG-BASED FAILURES table now has separate "Job-Level Shard" and "Test-Level Shard" columns so reviewers can jump directly to the CI job and any intra-file shard. - FAILED TESTS table columns renamed from "Shard (rocm/cuda)" to "Job-Level Shard (rocm/cuda)" for consistency with the log-based table (these values are already derived from the XML report dir name, e.g. test-default-3-6).

rocm-repo-management-api · 2026-04-21T16:51:46Z

Jenkins build for 81c66e65c6790db005efda1c3918f359684ffd5b commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

When a test failure is already reported in the XML-based FAILED TESTS table, it would also appear in LOG-BASED FAILURES whenever the same shard's log contained a "failed!" or "FAILED CONSISTENTLY" line. That made the summary look like two separate failures when there was only one. The LOG-BASED section is meant for failures *not* captured by XML (timeouts, crashes, process kills), so skip any entry whose (arch, test_file, test_class, test_name) tuple already appears in the FAILED TESTS table. Also normalize test_file before comparing, since XML uses dotted paths (e.g. distributed.test_symmetric_memory) while logs use slash paths (distributed/test_symmetric_memory, sometimes with a trailing .py). On run 24735028060 this drops the LOG-BASED section from 21 rows to 6 truly XML-missing timeouts.

rocm-repo-management-api · 2026-04-22T15:58:38Z

Jenkins build for 8d063809e99a31079baa98d915540d6f88df8a1b commit finished as NOT_BUILT
Links: Pipeline Overview / Build artifacts / Test Results

…shards inventory detect_log_failures.py now emits a sibling log_shards_<arch>.csv alongside log_failures_<arch>.csv, capturing every (platform, test_config, job_shard, test_file) -> observed test-level shards combination seen in the raw CI logs. generate_summary.py consumes the inventory to back-fill a "Test-Level Shard (rocm)" and "Test-Level Shard (cuda)" column in the XML-based FAILED TESTS table (XML artifacts don't contain test-level shard metadata, so we recover it by matching the job-level shard + test file to the log inventory). For intra-file-sharded test files (e.g. test_torchinductor_opinfo_properties split into 14 pytest shards), the value is rendered compactly as "1,6,12/14". The LOG-BASED FAILURES table already displayed test-level shard per entry; no change there beyond the existing column. parity.yml: exclude log_shards_*.csv from the CSV discovery glob in the summarize step so the new inventory file isn't mistaken for a parity CSV.

rocm-repo-management-api · 2026-04-22T18:57:35Z

Jenkins build for 8d063809e99a31079baa98d915540d6f88df8a1b commit finished as NOT_BUILT
Links: Pipeline Overview / Build artifacts / Test Results

… in LOG-BASED FAILURES detect_log_failures.py: - parse_log_file now also returns a flaky_tests list. When CI's "Test succeeded in new process, continuing with the rest of the tests" marker follows an individual-test PASSED line, the corresponding test is recorded as flaky (the preceding normal-process run failed, hence the rerun). - scan_logs emits these as structured records with platform, test_config, test_file, test_class, test_name, job_shard, and test_shard. - A sibling flaky_tests_<arch>.csv is written next to log_failures_<arch>.csv, via the generalized _derive_sibling_path(). generate_summary.py: - load_flaky_tests_as_log_failures() reads the flaky CSV and shapes it like log-failure rows with category='FLAKY'. main() appends these to the log_failures list. - FLAKY entries are exempted from the XML-vs-log dedup filter in the LOG-BASED FAILURES table, since a rerun-passed signal is orthogonal to any hard failure recorded in XML. - Cross-arch "Also Failing In" now naturally links matching flaky tests across architectures. Verified locally on run 24735028060 artifacts: 20 flaky entries for mi200 and 9 for mi355 (exact 1:1 with "Test succeeded in new process" log lines), including tests like test_flex_attention_with_dynamic_max_autotune_graph_partition_cuda and test_template_epilogue_fusion_static_analysis_...use_async_compile_True that the dashboard owner flagged from run 24796654604.

rocm-repo-management-api · 2026-04-22T19:57:41Z

Jenkins build for 2051d20f0c6d70d2d7b9ca0644c3a2f1a6f5d9ab commit finished as NOT_BUILT
Links: Pipeline Overview / Build artifacts / Test Results

The summarize job picks the first matching *.csv in the per-arch artifact dir, filtering out auxiliary files. Now that detect_log_failures.py also emits a sibling flaky_tests_<arch>.csv, it can be mistakenly picked up as the parity CSV (e.g. when ordering puts it first), causing generate_summary.py to crash with KeyError: 'status_set1'. Add it to the exclusion list alongside log_failures_*.csv and log_shards_*.csv.

rocm-repo-management-api · 2026-04-22T20:56:59Z

Jenkins build for 2051d20f0c6d70d2d7b9ca0644c3a2f1a6f5d9ab commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

jithunnair-amd

These are some awesome improvements! I did cross-check that the flaky tests was unable to catch some tests such as the following in shard 5 for mi355:

2026-04-22T10:26:36.9749161Z inductor/test_max_autotune.py::TestMaxAutotuneAsyncPipelined::test_triton_error_precompilation_and_autotuning E0422 10:25:44.014000 1154211 site-packages/torch/_inductor/select_algorithm.py:3854] [0/0] Runtime error for autotuning triton choices, defaulting to extern kernels.
2026-04-22T10:26:36.9750048Z W0422 10:25:46.007000 1155334 site-packages/torch/_native/cutedsl_utils.py:55] CuTeDSL operators require optional Python packages `nvidia-cutlass-dsl` and `apache-tvm-ffi`; missing optional dependency `nvidia_cutlass_dsl` (importlib.util.find_spec(nvidia_cutlass_dsl) failed)
2026-04-22T10:26:36.9750731Z /var/lib/jenkins/pytorch/test/inductor/test_max_autotune.py:123: FutureWarning: torch.cuda._set_allocator_settings is deprecated. Use torch._C._accelerator_setAllocatorSettings instead.
2026-04-22T10:26:36.9751116Z   torch.cuda.memory._set_allocator_settings("expandable_segments:False")
2026-04-22T10:26:36.9751479Z E0422 10:26:02.003000 1154211 site-packages/torch/_inductor/select_algorithm.py:3854] [0/0] Runtime error for autotuning triton choices, defaulting to extern kernels.
2026-04-22T10:26:36.9751791Z PASSED [19.3006s] [100%]
2026-04-22T10:26:36.9751858Z 
2026-04-22T10:26:36.9752076Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_max_autotune/inductor.test_max_autotune-93dbff3468a90d16.xml -
2026-04-22T10:26:36.9752409Z ====================== 1 passed, 276 deselected in 19.34s ======================
2026-04-22T10:26:36.9753715Z Got exit code 0
2026-04-22T10:26:36.9753859Z Test succeeded in new process, continuing with the rest of the tests

We can refine the flaky tests logic to be more robust.

Show only ROCm failures in summary and add Also Failing In column

7398f03

Only display tests where ROCm status is FAILED (CUDA status shown as context column). Add cross-architecture lookup so each failure shows which other architectures have the same test failing.

Only detect ROCm log failures, skip CUDA and baseline logs

fad16ae

Remove cuda, cuda_dist, cuda_inductor, and baseline entries from LOG_FILE_MAP since only ROCm failures are relevant to the parity report.

jithunnair-amd reviewed Apr 17, 2026

View reviewed changes

Comment thread .automation_scripts/pytorch-unit-test-scripts/detect_log_failures.py

jithunnair-amd reviewed Apr 17, 2026

View reviewed changes

Comment thread .automation_scripts/pytorch-unit-test-scripts/generate_summary.py Outdated

ethanwee1 requested a review from jithunnair-amd April 20, 2026 19:40

jithunnair-amd changed the title ~~Show only ROCm failures in parity summary and add cross-arch column~~ [CI] Show only ROCm failures in parity summary and add cross-arch column Apr 23, 2026

ethanwee1 mentioned this pull request Apr 23, 2026

Parity Auto Trigger: run parity.yml per upstream commit once CI finishes #3176

Draft

jithunnair-amd approved these changes Apr 23, 2026

View reviewed changes

jithunnair-amd merged commit 9f8ad3e into develop Apr 23, 2026
4 of 7 checks passed

jithunnair-amd deleted the parity-summary-improvements branch April 23, 2026 14:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Show only ROCm failures in parity summary and add cross-arch column#3153

[CI] Show only ROCm failures in parity summary and add cross-arch column#3153
jithunnair-amd merged 8 commits intodevelopfrom
parity-summary-improvements

ethanwee1 commented Apr 14, 2026 •

edited by jithunnair-amd

Loading

Uh oh!

rocm-repo-management-api Bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

rocm-repo-management-api Bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

rocm-repo-management-api Bot commented Apr 20, 2026

Uh oh!

rocm-repo-management-api Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

rocm-repo-management-api Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

rocm-repo-management-api Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

rocm-repo-management-api Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

rocm-repo-management-api Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

jithunnair-amd left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ethanwee1 commented Apr 14, 2026 • edited by jithunnair-amd Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

rocm-repo-management-api Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rocm-repo-management-api Bot commented Apr 20, 2026

Uh oh!

rocm-repo-management-api Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jithunnair-amd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ethanwee1 commented Apr 14, 2026 •

edited by jithunnair-amd

Loading

rocm-repo-management-api Bot commented Apr 14, 2026 •

edited

Loading

rocm-repo-management-api Bot commented Apr 14, 2026 •

edited

Loading

rocm-repo-management-api Bot commented Apr 21, 2026 •

edited

Loading

rocm-repo-management-api Bot commented Apr 22, 2026 •

edited

Loading

rocm-repo-management-api Bot commented Apr 22, 2026 •

edited

Loading

rocm-repo-management-api Bot commented Apr 22, 2026 •

edited

Loading

rocm-repo-management-api Bot commented Apr 22, 2026 •

edited

Loading