Consolidate CI test jobs: merge GPU smoke test and add Python version matrix by ivanbasov · Pull Request #8 · NVIDIA/Ising-Decoding

ivanbasov · 2026-03-05T20:17:39Z

Summary

Consolidates CI from 11 jobs down to 8, with better actual test coverage and cleaner PR experience.

Changes

Merged smoke-test-gpu into gpu-tests: The smoke job previously waited for gpu-tests to finish (needs: gpu-tests) before starting its own setup cycle, adding serial overhead. Both now run in a single job.
Replaced python-compat (6 jobs, SKIP_TESTS=1) with jobs that actually run tests:
- unit-tests (3 jobs, CPU): Matrix over Python 3.11/3.12/3.13. Installs inference deps and runs the full test suite with pre-trained models. GPU-specific tests auto-skip.
- gpu-tests (3 jobs, GPU): Matrix over Python 3.11/3.12/3.13. Installs train deps via deadsnakes PPA, runs full test suite (CPU + GPU), then smoke training + inference.
Split GPU tests into ci-gpu.yml: NVIDIA self-hosted runners block pull_request events entirely, which caused GPU matrix jobs to show as a single confusing "Skipped" entry on PRs. Separate workflow keeps PR checks clean.
Added pull_request trigger to ci.yml: CPU jobs now run on every PR targeting main (previously only ran on push to main).
GPU CI on PRs via copy-pr-bot: ci-gpu.yml triggers on push to pull-request/[0-9]+ branches created by copy-pr-bot.

Before (11 jobs, 1 workflow)

Job	Runner	Notes
spdx-header-check	CPU
unit-tests	CPU	py3.12 only
unit-tests-coverage	CPU	py3.12 only
compat / py3.{11,12,13} / inference	CPU	`SKIP_TESTS=1` — only checked imports
compat / py3.{11,12,13} / train	CPU	`SKIP_TESTS=1` — only checked imports
gpu-tests	GPU	single Python version
smoke-test-gpu	GPU	serial after gpu-tests (`needs:`)

After (8 jobs, 2 workflows)

ci.yml (runs on pull_request, push to main, merge_group):

Job	Runner	Notes
spdx-header-check	CPU
unit-tests / py3.{11,12,13}	CPU	runs real tests with pre-trained models
unit-tests-coverage	CPU	py3.12, generates coverage report

ci-gpu.yml (runs on push to main / pull-request/[0-9]+, merge_group):

Job	Runner	Notes
gpu / py3.{11,12,13}	GPU	full test suite + smoke train/inference, all parallel

Test plan

YAML validated locally
CPU jobs (unit-tests, coverage, spdx) pass on PR
GPU jobs pass on push to main (requires merge; deadsnakes + DEBIAN_FRONTEND=noninteractive tested in earlier runs)

… matrix - Remove separate smoke-test-gpu job (was serial after gpu-tests, increasing pipeline time). Smoke training+inference now runs in the same gpu-tests job. - Replace python-compat matrix (6 jobs, SKIP_TESTS=1) with two focused job groups that actually run tests: * gpu-tests: matrix over Python 3.11/3.12/3.13 on GPU runners — installs train deps, runs full test suite (CPU+GPU), then smoke training+inference. * inference-tests: matrix over Python 3.11/3.12/3.13 on CPU — installs inference deps, runs tests with pre-trained models (GPU tests auto-skip). Reduces total jobs from 11 to 9 while increasing actual test coverage. Made-with: Cursor

The deadsnakes PPA pulls in tzdata as a dependency, which triggers an interactive timezone configuration prompt in the container. This caused all 3 GPU matrix jobs to hang for 45 minutes until timeout. Made-with: Cursor

Without the pull_request trigger, CI never fires on PRs — checks aren't even planned (e.g. PR #9 shows zero checks). GPU jobs are gated to push/merge_group events to avoid consuming self-hosted GPU runners on every PR update. Made-with: Cursor

GPU jobs complete in ~5-10 minutes and serve as a useful pre-merge check. Made-with: Cursor

The copy-pr-bot creates pull-request/N branches for each PR, which matched the push trigger and caused every CI job to run twice (once from pull_request, once from push). The pull_request trigger already covers PRs targeting main, so the push pattern is redundant. Made-with: Cursor

NVIDIA self-hosted runners block pull_request events outright. GPU CI must run via push events — either to main or to pull-request/[0-9]+ branches created by copy-pr-bot for PR testing. - Restore "pull-request/[0-9]+" in push trigger - Gate gpu-tests with if: github.event_name != 'pull_request' - CPU jobs (inference-tests, unit-tests, etc.) still run on pull_request Made-with: Cursor

…jobs Simplify triggers: all jobs (including GPU) run on pull_request, push to main, and merge_group. The pull-request/[0-9]+ branch convention is not used by contributors. Made-with: Cursor

- Combine unit-tests (py3.12) and inference-tests (py3.11/3.12/3.13) into a single unit-tests matrix job across all three Python versions. Both ran identical test suites with inference requirements. - Re-add if: github.event_name != 'pull_request' on gpu-tests since NVIDIA self-hosted runners block pull_request events entirely. GPU CI runs on push to main and merge_group. Made-with: Cursor

NVIDIA self-hosted runners block pull_request events, so GPU jobs in the main CI workflow always showed as a single "Skipped" entry with unresolved matrix names on every PR. Move GPU jobs to ci-gpu.yml (triggers: push to main, merge_group, workflow_dispatch). The main ci.yml keeps CPU jobs only (triggers: pull_request, push to main, merge_group, workflow_dispatch). Made-with: Cursor

Add pull-request/[0-9]+ to ci-gpu.yml push trigger so GPU tests run when copy-pr-bot creates the corresponding branch for a PR. Made-with: Cursor

The container default shell is sh, which doesn't have the source builtin. Explicitly set shell: bash for the venv activation step. Made-with: Cursor

The smoke training step uses torch.compile which invokes the inductor backend, requiring a C compiler. The ubuntu:22.04 container doesn't ship with gcc. Made-with: Cursor

bmhowe23

This LGTM, but be advised we can also use nv-cpu-general for CPU runner jobs. In some sense, this is preferable because it uses Nvidia CPU runners.

bmhowe23

Oh, I meant to approve, so resubmitting now.

Use nv-cpu-general runner group instead of GitHub-hosted ubuntu-latest. Also restore pull-request/[0-9]+ push trigger in case self-hosted CPU runners block pull_request events (same as GPU runners). Made-with: Cursor

NVIDIA self-hosted runners block pull_request events. All CI (CPU and GPU) now runs via copy-pr-bot push to pull-request/[0-9]+ branches. Made-with: Cursor

… matrix (#8) * Consolidate CI test jobs: merge GPU smoke test and add Python version matrix - Remove separate smoke-test-gpu job (was serial after gpu-tests, increasing pipeline time). Smoke training+inference now runs in the same gpu-tests job. - Replace python-compat matrix (6 jobs, SKIP_TESTS=1) with two focused job groups that actually run tests: * gpu-tests: matrix over Python 3.11/3.12/3.13 on GPU runners — installs train deps, runs full test suite (CPU+GPU), then smoke training+inference. * inference-tests: matrix over Python 3.11/3.12/3.13 on CPU — installs inference deps, runs tests with pre-trained models (GPU tests auto-skip). Reduces total jobs from 11 to 9 while increasing actual test coverage. Made-with: Cursor * Fix GPU CI: set DEBIAN_FRONTEND=noninteractive to prevent tzdata hang The deadsnakes PPA pulls in tzdata as a dependency, which triggers an interactive timezone configuration prompt in the container. This caused all 3 GPU matrix jobs to hang for 45 minutes until timeout. Made-with: Cursor * Add pull_request trigger and gate GPU jobs to push/merge_group only Without the pull_request trigger, CI never fires on PRs — checks aren't even planned (e.g. PR #9 shows zero checks). GPU jobs are gated to push/merge_group events to avoid consuming self-hosted GPU runners on every PR update. Made-with: Cursor * Remove event gate on GPU jobs so they run on PRs too GPU jobs complete in ~5-10 minutes and serve as a useful pre-merge check. Made-with: Cursor * Remove pull-request/[0-9]+ from push trigger to fix duplicate CI runs The copy-pr-bot creates pull-request/N branches for each PR, which matched the push trigger and caused every CI job to run twice (once from pull_request, once from push). The pull_request trigger already covers PRs targeting main, so the push pattern is redundant. Made-with: Cursor * Fix GPU CI: gate on event type, restore push trigger for copy-pr-bot NVIDIA self-hosted runners block pull_request events outright. GPU CI must run via push events — either to main or to pull-request/[0-9]+ branches created by copy-pr-bot for PR testing. - Restore "pull-request/[0-9]+" in push trigger - Gate gpu-tests with if: github.event_name != 'pull_request' - CPU jobs (inference-tests, unit-tests, etc.) still run on pull_request Made-with: Cursor * Remove pull-request/[0-9]+ push pattern and pull_request gate on GPU jobs Simplify triggers: all jobs (including GPU) run on pull_request, push to main, and merge_group. The pull-request/[0-9]+ branch convention is not used by contributors. Made-with: Cursor * Merge unit-tests + inference-tests, gate GPU jobs from pull_request - Combine unit-tests (py3.12) and inference-tests (py3.11/3.12/3.13) into a single unit-tests matrix job across all three Python versions. Both ran identical test suites with inference requirements. - Re-add if: github.event_name != 'pull_request' on gpu-tests since NVIDIA self-hosted runners block pull_request events entirely. GPU CI runs on push to main and merge_group. Made-with: Cursor * Split GPU tests into separate workflow to avoid skipped PR noise NVIDIA self-hosted runners block pull_request events, so GPU jobs in the main CI workflow always showed as a single "Skipped" entry with unresolved matrix names on every PR. Move GPU jobs to ci-gpu.yml (triggers: push to main, merge_group, workflow_dispatch). The main ci.yml keeps CPU jobs only (triggers: pull_request, push to main, merge_group, workflow_dispatch). Made-with: Cursor * Enable GPU CI on PRs via copy-pr-bot push trigger Add pull-request/[0-9]+ to ci-gpu.yml push trigger so GPU tests run when copy-pr-bot creates the corresponding branch for a PR. Made-with: Cursor * Fix smoke test step: use bash shell for source command The container default shell is sh, which doesn't have the source builtin. Explicitly set shell: bash for the venv activation step. Made-with: Cursor * Install gcc in GPU container for torch.compile/inductor The smoke training step uses torch.compile which invokes the inductor backend, requiring a C compiler. The ubuntu:22.04 container doesn't ship with gcc. Made-with: Cursor * Switch CPU jobs to NVIDIA self-hosted linux-amd64-cpu4 runners Use nv-cpu-general runner group instead of GitHub-hosted ubuntu-latest. Also restore pull-request/[0-9]+ push trigger in case self-hosted CPU runners block pull_request events (same as GPU runners). Made-with: Cursor * Remove pull_request trigger since all runners are NVIDIA self-hosted NVIDIA self-hosted runners block pull_request events. All CI (CPU and GPU) now runs via copy-pr-bot push to pull-request/[0-9]+ branches. Made-with: Cursor

ivanbasov requested review from bmhowe23 March 5, 2026 22:15

ivanbasov added 10 commits March 5, 2026 14:37

Remove event gate on GPU jobs so they run on PRs too

6a05292

GPU jobs complete in ~5-10 minutes and serve as a useful pre-merge check. Made-with: Cursor

Remove pull-request/[0-9]+ push pattern and pull_request gate on GPU …

d35b78c

…jobs Simplify triggers: all jobs (including GPU) run on pull_request, push to main, and merge_group. The pull-request/[0-9]+ branch convention is not used by contributors. Made-with: Cursor

Enable GPU CI on PRs via copy-pr-bot push trigger

932a3e8

Add pull-request/[0-9]+ to ci-gpu.yml push trigger so GPU tests run when copy-pr-bot creates the corresponding branch for a PR. Made-with: Cursor

ivanbasov force-pushed the refactor/consolidate-ci-test-jobs branch from f8d312e to 932a3e8 Compare March 5, 2026 22:37

ivanbasov added 2 commits March 5, 2026 15:08

Fix smoke test step: use bash shell for source command

c8589c8

The container default shell is sh, which doesn't have the source builtin. Explicitly set shell: bash for the venv activation step. Made-with: Cursor

Install gcc in GPU container for torch.compile/inductor

f33d521

The smoke training step uses torch.compile which invokes the inductor backend, requiring a C compiler. The ubuntu:22.04 container doesn't ship with gcc. Made-with: Cursor

bmhowe23 reviewed Mar 5, 2026

View reviewed changes

bmhowe23 approved these changes Mar 5, 2026

View reviewed changes

ivanbasov added 2 commits March 5, 2026 17:15

Switch CPU jobs to NVIDIA self-hosted linux-amd64-cpu4 runners

d0604a9

Use nv-cpu-general runner group instead of GitHub-hosted ubuntu-latest. Also restore pull-request/[0-9]+ push trigger in case self-hosted CPU runners block pull_request events (same as GPU runners). Made-with: Cursor

Remove pull_request trigger since all runners are NVIDIA self-hosted

feaed66

NVIDIA self-hosted runners block pull_request events. All CI (CPU and GPU) now runs via copy-pr-bot push to pull-request/[0-9]+ branches. Made-with: Cursor

ivanbasov merged commit 5c392b7 into NVIDIA:main Mar 6, 2026
8 checks passed

ivanbasov deleted the refactor/consolidate-ci-test-jobs branch March 6, 2026 01:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate CI test jobs: merge GPU smoke test and add Python version matrix#8

Consolidate CI test jobs: merge GPU smoke test and add Python version matrix#8
ivanbasov merged 14 commits into
NVIDIA:mainfrom
ivanbasov:refactor/consolidate-ci-test-jobs

ivanbasov commented Mar 5, 2026 •

edited

Loading

Uh oh!

bmhowe23 left a comment

Uh oh!

bmhowe23 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ivanbasov commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Before (11 jobs, 1 workflow)

After (8 jobs, 2 workflows)

Test plan

Uh oh!

bmhowe23 left a comment

Choose a reason for hiding this comment

Uh oh!

bmhowe23 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ivanbasov commented Mar 5, 2026 •

edited

Loading