Skip to content

Conversation

@adilhusain-s
Copy link
Collaborator

@adilhusain-s adilhusain-s commented Dec 24, 2025

Overview

This PR stabilizes the release pipeline by introducing partial manifest tooling, refactoring CI workflows to eliminate race conditions, and improving fault tolerance across architectures.

The key motivation is reliability.

Before this change, the pipeline tightly coupled builds, releases, and git updates inside matrix jobs. This made releases fragile, hard to recover from, and increasingly error-prone after adding Trivy scanning. In particular:

  • Non-release artifacts (Trivy SBOMs and scan reports) were accidentally being picked up during manifest parsing.
  • A failure on a single architecture could cancel the entire workflow.
  • Concurrent matrix jobs attempted to push to the repository, causing race conditions and flaky failures.
  • Recovering from partial failures required rerunning the full workflow.

This PR decouples artifact generation from manifest updates, introduces an explicit aggregation step, and makes the pipeline resilient to partial failures.


How the Release Pipeline Works (After This PR)

At a high level, the pipeline now runs in four clearly separated phases:

  1. Discover which Python versions to release
  2. Build artifacts per architecture
  3. Create or update GitHub releases
  4. Aggregate partial manifests and update tracked data atomically

This separation is intentional and is what fixes the reliability issues.


Pipeline Flow Explained

1. Tag Discovery (get-tags job)

The workflow first determines which Python versions should be processed.

  • If .github/release/python-tag-filter.txt exists, it is used as a filter (e.g. 3.13.*).
  • Otherwise, the workflow derives a filter from the latest upstream Python version.
  • Matching Python tags are collected and passed as a JSON matrix.

This keeps the workflow deterministic and avoids manual inputs while still allowing controlled releases.


2. Build & Package (Matrix Jobs)

For each discovered Python tag, the workflow runs a matrix build across:

  • Architectures (ppc64le, s390x)
  • Ubuntu versions (22.04, 24.04)

Key design choices:

  • fail-fast: false
    A failure on one architecture does not cancel other builds.
  • Each matrix job:
    • Builds the Python artifacts
    • Uploads them to GitHub Releases
    • Generates a partial manifest describing only its own artifacts

Partial manifests are uploaded as workflow artifacts and do not touch git.


3. Release Asset Finalization (release-assets job)

Once builds complete, a follow-up job ensures release assets are finalized per Python version.

  • Operates per Python tag (not per architecture)
  • Proceeds even if some architectures failed
  • Ensures release metadata is consistent

4. Manifest Aggregation (update-manifests job)

Instead of each build job pushing to the repository, a single aggregation job now runs:

  • Downloads all available partial manifest artifacts
    (missing artifacts are tolerated for failed architectures)
  • Merges partial manifests into the tracked data
  • Commits and pushes changes once, atomically

Concurrency is controlled so only one aggregation runs per ref.

If a build for one architecture fails, only that job needs to be rerun.
The regenerated partial manifest can then be recombined without restarting the full workflow.


Key Changes

Infrastructure & Security

  • Added retry logic (8 attempts, 5s delay) to dotnet-install.py to handle transient network failures
  • Upgraded Trivy to v0.68.2 with strict failure thresholds
  • Simplified Makefile by removing unnecessary sudo usage

Partial Manifest Tooling

  • generate_partial_manifest.py: Generates architecture-scoped partial manifests
  • apply_partial_manifests.py: Merges partial manifests
  • backfill-manifests.yml: Regenerates or fixes manifests for existing releases without rebuilding binaries
  • Added unit tests for manifest generation and merging logic

This prevents Trivy-generated assets from leaking into release metadata.


CI/CD Workflow Refactor

  • Removed git push operations from matrix jobs
  • Introduced a single aggregation step
  • Added concurrency controls to serialize updates
  • Disabled fail-fast to preserve successful builds

The pipeline now follows an Artifact → Aggregate → Commit model.


Technical Rationale

Pushing to main from within a matrix strategy caused race conditions and flaky failures.
The new aggregation model eliminates these issues and allows partial recovery without full reruns.


Verification

  • ✅ Unit tests for partial manifest generation and merging
  • ✅ Infrastructure validated with upgraded Trivy
  • ✅ Backfill workflow verified to parse tags and generate partial artifacts correctly

adilhusain-s and others added 12 commits December 24, 2025 10:04
- dotnet-install.py: Add retry logic (8 attempts) for JSON fetching to handle network flakes.
- Makefile: Upgrade Trivy to v0.68.2 and enforce build failure on High/Critical vulnerabilities.

Signed-off-by: Adilhusain Shaikh <Adilhusain.Shaikh@ibm.com>
- Add 'generate_partial_manifest.py' and 'apply_partial_manifests.py' scripts.
- Add 'backfill-manifests.yml' workflow to process partial manifests.
- Add unit tests for manifest generation and application logic.

Signed-off-by: Adilhusain Shaikh <Adilhusain.Shaikh@ibm.com>

fix(tests): update error message assertion for invalid JSON handling

Signed-off-by: Adilhusain Shaikh <Adilhusain.Shaikh@ibm.com>
- release-matching-python-tags: Target Python 3.13.* and implement concurrency groups.
- reusable-release-python-tar: Remove direct Git push logic; generate partial manifest artifacts instead.
- release-matching-python-tags: Add 'update-manifests' job to aggregate partials and commit atomically.
- Optimize 'max-parallel' and disable 'fail-fast' for better resilience.

Signed-off-by: Adilhusain Shaikh <Adilhusain.Shaikh@ibm.com>
- Drop legacy manifest files for Python 3.9, 3.10, 3.11, and 3.12.
- Add and update manifest definitions for Python 3.13.x and 3.14.x on ppc64le and s390x architectures.

Signed-off-by: Adilhusain Shaikh <Adilhusain.Shaikh@ibm.com>
Signed-off-by: Adilhusain Shaikh <Adilhusain.Shaikh@ibm.com>
…gged URLs

Signed-off-by: Adilhusain Shaikh <Adilhusain.Shaikh@ibm.com>
Signed-off-by: Adilhusain Shaikh <Adilhusain.Shaikh@ibm.com>
Signed-off-by: Adilhusain Shaikh <Adilhusain.Shaikh@ibm.com>
…nd improve descriptions

Signed-off-by: Adilhusain Shaikh <Adilhusain.Shaikh@ibm.com>
Signed-off-by: Adilhusain Shaikh <Adilhusain.Shaikh@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant