ci: stabilize iluvatar runner and test images#625
Merged
voltjia merged 3 commits intoMay 29, 2026
Conversation
526f158 to
2cc7bf6
Compare
2cc7bf6 to
8a56c69
Compare
bitzyz
approved these changes
May 29, 2026
voltjia
approved these changes
May 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
iluvatar_gpuCI job in.github/ci_config.yml.pytest -n 8topytest -n 4and extend the job timeout from3600seconds to7200seconds.InfiniTensor/cirevisionb45d360c8cc529747ee31c5451d7eac96ac9f309, which reuses unchanged local test images with content-based tags and supports static GPU IDs when probing is unavailable and falls back from unwritable host lock directories.gpu_ids: "0"so the CI runner takes a deterministic file-lock lease without relying onixsmihost probing.Motivation
Recent Iluvatar checks on the upstream PR stack failed with full-regression timeouts and
exit 137kills. The Iluvatar job was configured withngpus: 0, so the CI agent did not take a device lease for the job, while the test stage still ran the full Iluvatar suite with high parallelism.This PR makes the job reserve one Iluvatar device, lowers local memory pressure, and pulls in the reusable CI workflow update from #626 so unchanged test images can be reused instead of rebuilt.
The first #625 hardware run exposed that
ixsmiprobing on the Iluvatar runner can return no devices before the container starts. The second run exposed a root-owned/tmp/infinitensor-ci-resource-locksdirectory on the NVIDIA runner. The third run showed the Iluvatar shadow job correctly waiting on the static device lock but timing out after the old 600-second queue window. A later NVIDIA run showed the same 600-second queue window is too short when all detected GPUs are already busy. The follow-up CI pin lets explicitgpu_idsuse host file locks directly, falls back to a user-writable lock directory when the default is unavailable, and lets Iluvatar setCUDA_VISIBLE_DEVICES=0without depending on auto-probing; the Iluvatar queue window now matches the 7200-second test timeout so shadow can wait for the main job. Other 60-minute platforms now use a 3600-second queue window to avoid false failures under normal runner contention.Closes N/A
Type of Change
feat— new feature / new operator / new platformfix— bug fixperf— performance improvement (no behavioral change)refactor— code restructuring without behavior changetest— adding or fixing tests onlydocs— documentation onlybuild/ci— build system or CI configurationchore— tooling, formatting, or other non-code changes!in the Conventional Commits prefix or aBREAKING CHANGE:footer)Platforms Affected
WITH_CPU)WITH_NVIDIA)WITH_ILUVATAR)WITH_METAX)WITH_CAMBRICON)WITH_MOORE)WITH_ASCEND)WITH_TORCH)Test Results on Supported Platforms
pytestResultci / unit / nvidiaandci-v2-shadow / ci-v2-shadow / nvidiasucceeded; queue timeout is now 3600 seconds.ci / unit / iluvatarandci-v2-shadow / ci-v2-shadow / iluvatarsucceeded; local matrix confirmsgpu_ids=0,ngpus=1,timeout_minutes=120,queue_timeout=7200, andpytest -n 4.ci / unit / metaxandci-v2-shadow / ci-v2-shadow / metaxsucceeded.ci / unit / cambriconandci-v2-shadow / ci-v2-shadow / cambriconsucceeded.ci / unit / mooreandci-v2-shadow / ci-v2-shadow / mooresucceeded.ci / unit / ascendandci-v2-shadow / ci-v2-shadow / ascendsucceeded.Full `pytest` output (optional)
Benchmark / Performance Impact
N/A
Notes for Reviewers
ixsmiprobing is unavailable.8a56c691760d10e1df8e86aafc2a7de6997ba0a0.Checklist
Title, Branch, and Commits
feat(nvidia): …,fix(cuda/gemm): …).<type>/xxx-yyyy-zzzzwhere<type>matches the PR title's Conventional Commits type and words are joined with hyphens (seeCONTRIBUTING.md§Branches).CONTRIBUTING.md§Pull Requests).master— the branch is rebased cleanly on top of the currentmaster.fixup!/squash!/wipcommits remain.Scope and Design
CONTRIBUTING.md§Code/General).printf/std::cout/print(...)left behind, orTODOwithout an owner and issue link.General Code Hygiene (applies to all languages)
CONTRIBUTING.md§Code/General).CONTRIBUTING.md§Code/General).C++ Specific (if C++ files changed)
Python Specific (if Python files changed)
Testing
pytestmust run in CI; this PR changes the CI runner configuration itself.tests/. No operator or runtime functionality was added.pytest.mark.auto_act_and_assert. No tests were added.dtype/deviceparameterization. No tests were added.Build, CI, and Tooling
compile_commands.json. This PR does not change CMake configuration..ci/config_to_matrix.py..ci.Documentation
Security and Safety