Skip to content

Consolidate CI test jobs: merge GPU smoke test and add Python version matrix#8

Merged
ivanbasov merged 14 commits into
NVIDIA:mainfrom
ivanbasov:refactor/consolidate-ci-test-jobs
Mar 6, 2026
Merged

Consolidate CI test jobs: merge GPU smoke test and add Python version matrix#8
ivanbasov merged 14 commits into
NVIDIA:mainfrom
ivanbasov:refactor/consolidate-ci-test-jobs

Conversation

@ivanbasov
Copy link
Copy Markdown
Collaborator

@ivanbasov ivanbasov commented Mar 5, 2026

Summary

Consolidates CI from 11 jobs down to 8, with better actual test coverage and cleaner PR experience.

Changes

  • Merged smoke-test-gpu into gpu-tests: The smoke job previously waited for gpu-tests to finish (needs: gpu-tests) before starting its own setup cycle, adding serial overhead. Both now run in a single job.
  • Replaced python-compat (6 jobs, SKIP_TESTS=1) with jobs that actually run tests:
    • unit-tests (3 jobs, CPU): Matrix over Python 3.11/3.12/3.13. Installs inference deps and runs the full test suite with pre-trained models. GPU-specific tests auto-skip.
    • gpu-tests (3 jobs, GPU): Matrix over Python 3.11/3.12/3.13. Installs train deps via deadsnakes PPA, runs full test suite (CPU + GPU), then smoke training + inference.
  • Split GPU tests into ci-gpu.yml: NVIDIA self-hosted runners block pull_request events entirely, which caused GPU matrix jobs to show as a single confusing "Skipped" entry on PRs. Separate workflow keeps PR checks clean.
  • Added pull_request trigger to ci.yml: CPU jobs now run on every PR targeting main (previously only ran on push to main).
  • GPU CI on PRs via copy-pr-bot: ci-gpu.yml triggers on push to pull-request/[0-9]+ branches created by copy-pr-bot.

Before (11 jobs, 1 workflow)

Job Runner Notes
spdx-header-check CPU
unit-tests CPU py3.12 only
unit-tests-coverage CPU py3.12 only
compat / py3.{11,12,13} / inference CPU SKIP_TESTS=1 — only checked imports
compat / py3.{11,12,13} / train CPU SKIP_TESTS=1 — only checked imports
gpu-tests GPU single Python version
smoke-test-gpu GPU serial after gpu-tests (needs:)

After (8 jobs, 2 workflows)

ci.yml (runs on pull_request, push to main, merge_group):

Job Runner Notes
spdx-header-check CPU
unit-tests / py3.{11,12,13} CPU runs real tests with pre-trained models
unit-tests-coverage CPU py3.12, generates coverage report

ci-gpu.yml (runs on push to main / pull-request/[0-9]+, merge_group):

Job Runner Notes
gpu / py3.{11,12,13} GPU full test suite + smoke train/inference, all parallel

Test plan

  • YAML validated locally
  • CPU jobs (unit-tests, coverage, spdx) pass on PR
  • GPU jobs pass on push to main (requires merge; deadsnakes + DEBIAN_FRONTEND=noninteractive tested in earlier runs)

@ivanbasov ivanbasov requested review from bmhowe23 March 5, 2026 22:15
ivanbasov added 10 commits March 5, 2026 14:37
… matrix

- Remove separate smoke-test-gpu job (was serial after gpu-tests, increasing
  pipeline time). Smoke training+inference now runs in the same gpu-tests job.
- Replace python-compat matrix (6 jobs, SKIP_TESTS=1) with two focused job
  groups that actually run tests:
  * gpu-tests: matrix over Python 3.11/3.12/3.13 on GPU runners — installs
    train deps, runs full test suite (CPU+GPU), then smoke training+inference.
  * inference-tests: matrix over Python 3.11/3.12/3.13 on CPU — installs
    inference deps, runs tests with pre-trained models (GPU tests auto-skip).

Reduces total jobs from 11 to 9 while increasing actual test coverage.

Made-with: Cursor
The deadsnakes PPA pulls in tzdata as a dependency, which triggers an
interactive timezone configuration prompt in the container. This caused
all 3 GPU matrix jobs to hang for 45 minutes until timeout.

Made-with: Cursor
Without the pull_request trigger, CI never fires on PRs — checks aren't
even planned (e.g. PR #9 shows zero checks). GPU jobs are gated to
push/merge_group events to avoid consuming self-hosted GPU runners on
every PR update.

Made-with: Cursor
GPU jobs complete in ~5-10 minutes and serve as a useful pre-merge check.

Made-with: Cursor
The copy-pr-bot creates pull-request/N branches for each PR, which
matched the push trigger and caused every CI job to run twice (once
from pull_request, once from push). The pull_request trigger already
covers PRs targeting main, so the push pattern is redundant.

Made-with: Cursor
NVIDIA self-hosted runners block pull_request events outright. GPU CI
must run via push events — either to main or to pull-request/[0-9]+
branches created by copy-pr-bot for PR testing.

- Restore "pull-request/[0-9]+" in push trigger
- Gate gpu-tests with if: github.event_name != 'pull_request'
- CPU jobs (inference-tests, unit-tests, etc.) still run on pull_request

Made-with: Cursor
…jobs

Simplify triggers: all jobs (including GPU) run on pull_request, push
to main, and merge_group. The pull-request/[0-9]+ branch convention is
not used by contributors.

Made-with: Cursor
- Combine unit-tests (py3.12) and inference-tests (py3.11/3.12/3.13)
  into a single unit-tests matrix job across all three Python versions.
  Both ran identical test suites with inference requirements.
- Re-add if: github.event_name != 'pull_request' on gpu-tests since
  NVIDIA self-hosted runners block pull_request events entirely.
  GPU CI runs on push to main and merge_group.

Made-with: Cursor
NVIDIA self-hosted runners block pull_request events, so GPU jobs in
the main CI workflow always showed as a single "Skipped" entry with
unresolved matrix names on every PR.

Move GPU jobs to ci-gpu.yml (triggers: push to main, merge_group,
workflow_dispatch). The main ci.yml keeps CPU jobs only (triggers:
pull_request, push to main, merge_group, workflow_dispatch).

Made-with: Cursor
Add pull-request/[0-9]+ to ci-gpu.yml push trigger so GPU tests run
when copy-pr-bot creates the corresponding branch for a PR.

Made-with: Cursor
@ivanbasov ivanbasov force-pushed the refactor/consolidate-ci-test-jobs branch from f8d312e to 932a3e8 Compare March 5, 2026 22:37
The container default shell is sh, which doesn't have the source
builtin. Explicitly set shell: bash for the venv activation step.

Made-with: Cursor
The smoke training step uses torch.compile which invokes the inductor
backend, requiring a C compiler. The ubuntu:22.04 container doesn't
ship with gcc.

Made-with: Cursor
Copy link
Copy Markdown
Collaborator

@bmhowe23 bmhowe23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, but be advised we can also use nv-cpu-general for CPU runner jobs. In some sense, this is preferable because it uses Nvidia CPU runners.

Copy link
Copy Markdown
Collaborator

@bmhowe23 bmhowe23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I meant to approve, so resubmitting now.

Use nv-cpu-general runner group instead of GitHub-hosted ubuntu-latest.
Also restore pull-request/[0-9]+ push trigger in case self-hosted CPU
runners block pull_request events (same as GPU runners).

Made-with: Cursor
NVIDIA self-hosted runners block pull_request events. All CI (CPU and
GPU) now runs via copy-pr-bot push to pull-request/[0-9]+ branches.

Made-with: Cursor
@ivanbasov ivanbasov merged commit 5c392b7 into NVIDIA:main Mar 6, 2026
8 checks passed
@ivanbasov ivanbasov deleted the refactor/consolidate-ci-test-jobs branch March 6, 2026 01:28
ivanbasov added a commit that referenced this pull request Apr 10, 2026
… matrix (#8)

* Consolidate CI test jobs: merge GPU smoke test and add Python version matrix

- Remove separate smoke-test-gpu job (was serial after gpu-tests, increasing
  pipeline time). Smoke training+inference now runs in the same gpu-tests job.
- Replace python-compat matrix (6 jobs, SKIP_TESTS=1) with two focused job
  groups that actually run tests:
  * gpu-tests: matrix over Python 3.11/3.12/3.13 on GPU runners — installs
    train deps, runs full test suite (CPU+GPU), then smoke training+inference.
  * inference-tests: matrix over Python 3.11/3.12/3.13 on CPU — installs
    inference deps, runs tests with pre-trained models (GPU tests auto-skip).

Reduces total jobs from 11 to 9 while increasing actual test coverage.

Made-with: Cursor

* Fix GPU CI: set DEBIAN_FRONTEND=noninteractive to prevent tzdata hang

The deadsnakes PPA pulls in tzdata as a dependency, which triggers an
interactive timezone configuration prompt in the container. This caused
all 3 GPU matrix jobs to hang for 45 minutes until timeout.

Made-with: Cursor

* Add pull_request trigger and gate GPU jobs to push/merge_group only

Without the pull_request trigger, CI never fires on PRs — checks aren't
even planned (e.g. PR #9 shows zero checks). GPU jobs are gated to
push/merge_group events to avoid consuming self-hosted GPU runners on
every PR update.

Made-with: Cursor

* Remove event gate on GPU jobs so they run on PRs too

GPU jobs complete in ~5-10 minutes and serve as a useful pre-merge check.

Made-with: Cursor

* Remove pull-request/[0-9]+ from push trigger to fix duplicate CI runs

The copy-pr-bot creates pull-request/N branches for each PR, which
matched the push trigger and caused every CI job to run twice (once
from pull_request, once from push). The pull_request trigger already
covers PRs targeting main, so the push pattern is redundant.

Made-with: Cursor

* Fix GPU CI: gate on event type, restore push trigger for copy-pr-bot

NVIDIA self-hosted runners block pull_request events outright. GPU CI
must run via push events — either to main or to pull-request/[0-9]+
branches created by copy-pr-bot for PR testing.

- Restore "pull-request/[0-9]+" in push trigger
- Gate gpu-tests with if: github.event_name != 'pull_request'
- CPU jobs (inference-tests, unit-tests, etc.) still run on pull_request

Made-with: Cursor

* Remove pull-request/[0-9]+ push pattern and pull_request gate on GPU jobs

Simplify triggers: all jobs (including GPU) run on pull_request, push
to main, and merge_group. The pull-request/[0-9]+ branch convention is
not used by contributors.

Made-with: Cursor

* Merge unit-tests + inference-tests, gate GPU jobs from pull_request

- Combine unit-tests (py3.12) and inference-tests (py3.11/3.12/3.13)
  into a single unit-tests matrix job across all three Python versions.
  Both ran identical test suites with inference requirements.
- Re-add if: github.event_name != 'pull_request' on gpu-tests since
  NVIDIA self-hosted runners block pull_request events entirely.
  GPU CI runs on push to main and merge_group.

Made-with: Cursor

* Split GPU tests into separate workflow to avoid skipped PR noise

NVIDIA self-hosted runners block pull_request events, so GPU jobs in
the main CI workflow always showed as a single "Skipped" entry with
unresolved matrix names on every PR.

Move GPU jobs to ci-gpu.yml (triggers: push to main, merge_group,
workflow_dispatch). The main ci.yml keeps CPU jobs only (triggers:
pull_request, push to main, merge_group, workflow_dispatch).

Made-with: Cursor

* Enable GPU CI on PRs via copy-pr-bot push trigger

Add pull-request/[0-9]+ to ci-gpu.yml push trigger so GPU tests run
when copy-pr-bot creates the corresponding branch for a PR.

Made-with: Cursor

* Fix smoke test step: use bash shell for source command

The container default shell is sh, which doesn't have the source
builtin. Explicitly set shell: bash for the venv activation step.

Made-with: Cursor

* Install gcc in GPU container for torch.compile/inductor

The smoke training step uses torch.compile which invokes the inductor
backend, requiring a C compiler. The ubuntu:22.04 container doesn't
ship with gcc.

Made-with: Cursor

* Switch CPU jobs to NVIDIA self-hosted linux-amd64-cpu4 runners

Use nv-cpu-general runner group instead of GitHub-hosted ubuntu-latest.
Also restore pull-request/[0-9]+ push trigger in case self-hosted CPU
runners block pull_request events (same as GPU runners).

Made-with: Cursor

* Remove pull_request trigger since all runners are NVIDIA self-hosted

NVIDIA self-hosted runners block pull_request events. All CI (CPU and
GPU) now runs via copy-pr-bot push to pull-request/[0-9]+ branches.

Made-with: Cursor
ivanbasov added a commit that referenced this pull request Apr 10, 2026
… matrix (#8)

* Consolidate CI test jobs: merge GPU smoke test and add Python version matrix

- Remove separate smoke-test-gpu job (was serial after gpu-tests, increasing
  pipeline time). Smoke training+inference now runs in the same gpu-tests job.
- Replace python-compat matrix (6 jobs, SKIP_TESTS=1) with two focused job
  groups that actually run tests:
  * gpu-tests: matrix over Python 3.11/3.12/3.13 on GPU runners — installs
    train deps, runs full test suite (CPU+GPU), then smoke training+inference.
  * inference-tests: matrix over Python 3.11/3.12/3.13 on CPU — installs
    inference deps, runs tests with pre-trained models (GPU tests auto-skip).

Reduces total jobs from 11 to 9 while increasing actual test coverage.

Made-with: Cursor

* Fix GPU CI: set DEBIAN_FRONTEND=noninteractive to prevent tzdata hang

The deadsnakes PPA pulls in tzdata as a dependency, which triggers an
interactive timezone configuration prompt in the container. This caused
all 3 GPU matrix jobs to hang for 45 minutes until timeout.

Made-with: Cursor

* Add pull_request trigger and gate GPU jobs to push/merge_group only

Without the pull_request trigger, CI never fires on PRs — checks aren't
even planned (e.g. PR #9 shows zero checks). GPU jobs are gated to
push/merge_group events to avoid consuming self-hosted GPU runners on
every PR update.

Made-with: Cursor

* Remove event gate on GPU jobs so they run on PRs too

GPU jobs complete in ~5-10 minutes and serve as a useful pre-merge check.

Made-with: Cursor

* Remove pull-request/[0-9]+ from push trigger to fix duplicate CI runs

The copy-pr-bot creates pull-request/N branches for each PR, which
matched the push trigger and caused every CI job to run twice (once
from pull_request, once from push). The pull_request trigger already
covers PRs targeting main, so the push pattern is redundant.

Made-with: Cursor

* Fix GPU CI: gate on event type, restore push trigger for copy-pr-bot

NVIDIA self-hosted runners block pull_request events outright. GPU CI
must run via push events — either to main or to pull-request/[0-9]+
branches created by copy-pr-bot for PR testing.

- Restore "pull-request/[0-9]+" in push trigger
- Gate gpu-tests with if: github.event_name != 'pull_request'
- CPU jobs (inference-tests, unit-tests, etc.) still run on pull_request

Made-with: Cursor

* Remove pull-request/[0-9]+ push pattern and pull_request gate on GPU jobs

Simplify triggers: all jobs (including GPU) run on pull_request, push
to main, and merge_group. The pull-request/[0-9]+ branch convention is
not used by contributors.

Made-with: Cursor

* Merge unit-tests + inference-tests, gate GPU jobs from pull_request

- Combine unit-tests (py3.12) and inference-tests (py3.11/3.12/3.13)
  into a single unit-tests matrix job across all three Python versions.
  Both ran identical test suites with inference requirements.
- Re-add if: github.event_name != 'pull_request' on gpu-tests since
  NVIDIA self-hosted runners block pull_request events entirely.
  GPU CI runs on push to main and merge_group.

Made-with: Cursor

* Split GPU tests into separate workflow to avoid skipped PR noise

NVIDIA self-hosted runners block pull_request events, so GPU jobs in
the main CI workflow always showed as a single "Skipped" entry with
unresolved matrix names on every PR.

Move GPU jobs to ci-gpu.yml (triggers: push to main, merge_group,
workflow_dispatch). The main ci.yml keeps CPU jobs only (triggers:
pull_request, push to main, merge_group, workflow_dispatch).

Made-with: Cursor

* Enable GPU CI on PRs via copy-pr-bot push trigger

Add pull-request/[0-9]+ to ci-gpu.yml push trigger so GPU tests run
when copy-pr-bot creates the corresponding branch for a PR.

Made-with: Cursor

* Fix smoke test step: use bash shell for source command

The container default shell is sh, which doesn't have the source
builtin. Explicitly set shell: bash for the venv activation step.

Made-with: Cursor

* Install gcc in GPU container for torch.compile/inductor

The smoke training step uses torch.compile which invokes the inductor
backend, requiring a C compiler. The ubuntu:22.04 container doesn't
ship with gcc.

Made-with: Cursor

* Switch CPU jobs to NVIDIA self-hosted linux-amd64-cpu4 runners

Use nv-cpu-general runner group instead of GitHub-hosted ubuntu-latest.
Also restore pull-request/[0-9]+ push trigger in case self-hosted CPU
runners block pull_request events (same as GPU runners).

Made-with: Cursor

* Remove pull_request trigger since all runners are NVIDIA self-hosted

NVIDIA self-hosted runners block pull_request events. All CI (CPU and
GPU) now runs via copy-pr-bot push to pull-request/[0-9]+ branches.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants