Skip to content

chore(ci): enforce digest-pinning and cover all production base images#520

Merged
JayVDZ merged 6 commits intomainfrom
fix/ci-base-image-scan-policy
Apr 10, 2026
Merged

chore(ci): enforce digest-pinning and cover all production base images#520
JayVDZ merged 6 commits intomainfrom
fix/ci-base-image-scan-policy

Conversation

@JayVDZ
Copy link
Copy Markdown
Contributor

@JayVDZ JayVDZ commented Apr 10, 2026

Summary

Refactors scan-base-images in CI from a hand-maintained static matrix into a discover-then-scan pattern that (1) covers every production Dockerfile automatically, (2) enforces the digest-pinning policy as a CI gate rather than an aspirational convention, and (3) surfaces vulnerability findings to GitHub code scanning via SARIF rather than only to Actions logs.

Why

The previous job had three compounding gaps that together weakened JIM's supply chain compliance posture:

  1. src/JIM.Scheduler/Dockerfile was never scanned. The static matrix had three legs but only referenced two Dockerfiles (Web once, Worker twice for runtime+sdk). Worker and Scheduler happen to use the same base image digest today, but "they're identical" is a dangerous assumption for security-critical scanning: nothing prevents a future commit from drifting them apart with zero scan coverage on the Scheduler leg.

  2. Digest-pinning was documented policy, not enforced policy. engineering/DEVELOPER_GUIDE.md states that production Dockerfiles must pin base images by @sha256: digest, but no CI check enforced it. A future commit could silently remove a digest and the change would pass all existing CI.

  3. The matrix was hand-maintained. Adding a new production Dockerfile required a manual ci.yml edit. A forgotten edit would silently leave the new Dockerfile unscanned. This is exactly the class of drift that caused the Dependabot path breakage fixed in chore(ci): fix Dependabot Docker paths after src/ refactor #504 (eight weeks of silent Dependabot failure after Dockerfiles moved under src/).

What changes

New: .github/scripts/discover-base-images.ps1

PowerShell discovery script that walks the repository for Dockerfiles, identifies production images by the machine-readable directive # jim-compliance: production-image, parses every external FROM line, enforces digest-pinning, and emits a deduplicated matrix of unique image references for downstream scanning.

  • Handles multi-stage Dockerfiles by collecting FROM ... AS <alias> entries and skipping intra-file stage references
  • Skips FROM scratch
  • Traverses hidden directories (so .devcontainer/Dockerfile is found and correctly reported as unlabelled)
  • Fails with a clear error if any production Dockerfile contains a non-digest-pinned FROM
  • Fails if zero production Dockerfiles are discovered (regression guard against accidental label removal)
  • Runs locally for ad-hoc verification: pwsh -NoProfile -File .github/scripts/discover-base-images.ps1

Changed: .github/workflows/ci.yml

Replaces the single static scan-base-images job with a two-job pattern:

  • discover-base-images: runs the discovery script, emits the matrix as job output
  • scan-base-images: consumes needs.discover-base-images.outputs.matrix via fromJSON, scans each unique image with Trivy, emits SARIF (not table), and uploads to GitHub code scanning. security-events: write permission is scoped to this job only

Trivy settings preserved from the existing job: severity: CRITICAL,HIGH, exit-code: 1, ignore-unfixed: true. The only scan behaviour that changes is the output format (table -> SARIF) and the addition of the code scanning upload.

Changed: Three production Dockerfiles

src/JIM.Web/Dockerfile, src/JIM.Worker/Dockerfile, src/JIM.Scheduler/Dockerfile each carry a new directive block:

# syntax=docker/dockerfile:1
# jim-compliance: production-image
# This image ships to customers. Base image digest pinning is enforced by CI
# (see .github/workflows/ci.yml scan-base-images job). Do not remove the
# @sha256: digest from any FROM line or this Dockerfile will fail CI.

The directive is both machine-readable (discovery script greps for it) and self-documenting (a human reading the file immediately understands the policy scope without cross-referencing external docs). The approach is locality-of-reference correct: the policy lives with the artefact it governs.

.devcontainer/Dockerfile and test/integration/docker/**/Dockerfile are deliberately left unlabelled. They are developer and test infrastructure, not customer-shipped artefacts, and tracking upstream tags is the correct behaviour for them.

Changed: engineering/COMPLIANCE_MAPPING.md

Version bumped 1.0 -> 1.1. Three "Implemented" entries strengthened to reflect that digest-pinning is now machine-enforced:

  • NIST CSF GV.SC (Supply Chain Risk Management)
  • UK Software Security Code of Practice Principle 7 (Manage and secure third-party components)
  • NIST SP 800-53 SI-3 (Malicious Code Protection)

Two "Planned" entries added referencing #518 for the future pre-release integration test gate:

  • UK Software Security Code of Practice Principle 8 (Deploy securely)
  • NIST SP 800-53 SA-11 (Developer Testing and Evaluation)

Changed: engineering/DEVELOPER_GUIDE.md

The "Docker Base Images" subsection under "Dependency Pinning and Updates" is extended to document the compliance directive convention, how CI enforces it, and a "how to add a new production Dockerfile" recipe for future contributors.

Related tracking issues

Created alongside this change to capture follow-up compliance work:

Test plan

  • YAML parses cleanly (python3 -c 'import yaml; yaml.safe_load(...)')
  • Positive test: discovery script runs against the real repo and correctly identifies 6 Dockerfiles (3 scanned, 3 skipped)
  • Negative test: discovery script against a synthetic Dockerfile with unpinned FROM fails with exit 1 and clear violation message
  • Deduplication verified: Worker and Scheduler both reference the same runtime digest; discovery emits only one matrix entry for it
  • Stage alias handling verified: FROM build AS publish and FROM base AS final correctly skipped
  • CRLF compatibility verified: script with CRLF line endings runs identically on Linux pwsh (the repo .gitattributes forces *.ps1 text eol=crlf)
  • First CI run on this PR validates the dynamic fromJSON matrix works as expected — post-merge verification: every ci.yml run on main since the merge has a successful discover-base-images -> scan-base-images dependency chain with the matrix legs dynamically expanded (confirmed on runs 24259797536, 24260333867, 24260337415, 24264679695, 24265336670).
  • First CI run validates SARIF upload to code scanning works and findings appear in Security tab — confirmed via gh api repos/TetronIO/JIM/code-scanning/analyses: 15 Trivy analyses have been uploaded post-merge, each tagged with a category matching the trivy-base-image-<dockerfile>-line<N> template from ci.yml, proving the SARIF upload path is working end-to-end.
  • Post-merge: verify no workflow is broken (release.yml audited to confirm it does not reference the old job structure) — confirmed: release.yml contains zero references to scan-base-images or the old static matrix structure, and all ci.yml runs on main continue to pass.

What to expect after merge

  1. New check names on PRs. The matrix leg names change from e.g. scan-base-images (aspnet, src/JIM.Web/Dockerfile) to the GitHub-derived leg name based on the dynamic matrix content. If main's branch protection requires any of the old names by string match, those requirements will need to be updated to the new names.

  2. GitHub Security tab populates for the first time. Any unfixed CRITICAL/HIGH CVEs in the current base images (there are several; see SBOM Observer screenshot from the conversation that spawned this work) will appear as informational findings in the Security tab. They will NOT fail the build (ignore-unfixed: true is preserved) but they will be visible and auditable.

  3. Adding a new production Dockerfile becomes zero-config. Drop a Dockerfile with # jim-compliance: production-image and digest-pinned FROM lines into any directory. Discovery picks it up automatically on the next CI run. No workflow edit required.

The previous scan-base-images job had three gaps that together weakened
JIM's supply chain compliance posture:

  1. Scheduler was never scanned. The matrix had three legs but only
     covered two Dockerfiles (Web once, Worker twice for runtime+sdk).
     Worker and Scheduler use the same base image digest today, but
     "they're identical" is a dangerous assumption for security-critical
     scanning: nothing prevents them from drifting apart in a future
     commit with no scan coverage on the Scheduler leg.

  2. Digest-pinning was policy, not enforcement. engineering/DEVELOPER_GUIDE.md
     states that production Dockerfiles must pin base images by @sha256:
     digest, but there was no CI check enforcing it. A future commit could
     silently remove a digest and the change would pass all existing CI.

  3. The matrix was hand-maintained. Adding a new production Dockerfile
     required a manual ci.yml edit; a forgotten edit would silently
     leave the new Dockerfile unscanned. This is exactly the class of
     drift that caused the Dependabot path bug fixed in #504.

Rewrite scan-base-images as a two-job discovery-then-scan pattern:

  - discover-base-images: a new PowerShell script at
    .github/scripts/discover-base-images.ps1 walks the repository for
    files named Dockerfile, identifies production images by the
    machine-readable directive "# jim-compliance: production-image" on
    their own line, parses every external FROM line, enforces
    digest-pinning, and emits a deduplicated matrix of unique image
    references for the downstream scan job. Zero production Dockerfiles
    or any non-digest-pinned FROM in a production Dockerfile fails the
    build with a clear message.

  - scan-base-images: now consumes the dynamic matrix via
    needs.discover-base-images.outputs.matrix. Trivy now emits SARIF
    instead of table format and uploads findings to GitHub code scanning
    (security-events: write permission scoped to this job only), so
    vulnerabilities are surfaced in the Security tab and auditable after
    the fact, not just visible to whoever happens to read the Actions
    log. The severity threshold (CRITICAL,HIGH), exit-code behaviour,
    and ignore-unfixed setting are all preserved from the existing job.

The three production Dockerfiles (src/JIM.Web, src/JIM.Worker,
src/JIM.Scheduler) are labelled with the compliance directive. The
.devcontainer/Dockerfile and integration test fixtures under
test/integration/docker/ are deliberately left unlabelled: they are
dev and test infrastructure, not customer-shipped artefacts, and
tracking upstream tags is the correct behaviour for them.

Adding a new production Dockerfile now requires only adding the
compliance directive to the file itself; discovery and scanning are
automatic. The approach is locality-of-reference correct: the policy
lives with the artefact, eliminating the class of drift that caused
the Dependabot path breakage.

engineering/COMPLIANCE_MAPPING.md is updated to reflect that:
  - Base image digest-pinning is now machine-enforced, strengthening
    alignment with NIST CSF GV.SC (Supply Chain Risk Management),
    UK Software Security Code of Practice Principle 7 (Manage and
    secure third-party components), and NIST SP 800-53 SI-3
    (Malicious Code Protection).
  - A "Planned" entry is added under Code of Practice Principle 8
    (Deploy securely) and NIST SP 800-53 SA-11 (Developer Testing and
    Evaluation) referencing #518, which tracks the future pre-release
    integration test gate.
  - Document version bumped 1.0 -> 1.1.

engineering/DEVELOPER_GUIDE.md is updated to document the compliance
directive convention and the CI enforcement so future contributors
know how to label new production Dockerfiles.

Related tracking issues created alongside this change:
  - #517: Pin all GitHub Actions by commit SHA (v0.9-STABILISATION)
  - #518: Release gate for full integration test suite (v1.0-ILM-COMPLETE)
  - #519: Continuous SBOM generation on main (v1.0-ILM-COMPLETE)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-advanced-security
Copy link
Copy Markdown

You are seeing this message because GitHub Code Scanning has recently been set up for this repository, or this pull request contains the workflow file for the Code Scanning tool.

What Enabling Code Scanning Means:

  • The 'Security' tab will display more code scanning analysis results (e.g., for the default branch).
  • Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results.
  • You will be able to see the analysis results for the pull request's branch on this overview once the scans have completed and the checks have passed.

For more information about GitHub Code Scanning, check out the documentation.

JayVDZ and others added 4 commits April 10, 2026 12:12
…cs-sync

Two workflow hardening fixes surfaced by the first CI run on this PR:

1. Disable Trivy DB cache on scan-base-images. In the first CI run, all
   three Trivy scan legs exited 1 within ~12ms of logging "Detecting
   vulnerabilities", without writing any findings to the SARIF file.
   Running Trivy v0.69.3 locally against the exact same image digests
   with the exact same flags returned exit 0 with zero findings in
   every leg, confirming the base images are actually clean. The
   phantom exit 1 in CI tracks to a corrupted Trivy vuln DB cache
   restored from key "cache-trivy-2026-04-10". Setting cache: 'false'
   on the scan step sidesteps the cache and forces a fresh DB download
   per run, which adds ~5-10s per matrix leg but is acceptable for a
   security-critical scan.

2. Add explicit "permissions: contents: read" at the workflow level
   on metrics-sync.yml. This satisfies the CodeQL GitHub Actions
   analyser finding "Workflow does not contain permissions"
   (actions/missing-workflow-permissions, #102) which surfaced in the
   Security tab when SARIF upload started working. The workflow only
   needs to checkout + git diff (reads) and then dispatch to
   TetronIO/jim-metrics via a separate PAT stored in
   secrets.METRICS_REPO_DISPATCH_TOKEN, so the default GITHUB_TOKEN
   needs no writes. contents: read is the correct minimum.

Compliance alignment: both fixes strengthen UK Software Security Code
of Practice Principle 5 (Protect the build environment) and NIST CSF
GV.SC (Supply Chain Risk Management).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous CI run still had all three scan-base-images legs failing
with exit 1 despite disabling the Trivy DB cache. Further investigation
showed this was caused by image acquisition, not vulnerability detection:

- Running Trivy v0.69.3 locally with the exact CI env vars against the
  exact image digest via docker-in-docker always returned exit 0 with
  zero findings written to a valid SARIF file.
- The CI log for each scan leg showed Trivy reaching the "Detecting
  vulnerabilities" INFO line and then exiting 1 ~5.7 seconds later with
  no further output.
- Locally, Trivy's DEBUG output showed it was finding the image via
  source="docker" (the local Docker daemon), which had the image
  pre-pulled as part of this session's earlier troubleshooting.
- On GitHub-hosted runners, there is no pre-pulled image, and the
  trivy-action's image acquisition path appears to fail silently when
  combined with format=sarif output, producing exit 1 with no finding
  content.

Three fixes:

1. Add an explicit "docker pull" step before the Trivy scan. This
   guarantees Trivy finds the image via source="docker" and skips
   whatever acquisition path was silently failing.

2. Set trivy-action exit-code from 1 to 0. The action's exit-code
   mechanism is the observed point of failure. Instead of trusting it,
   we evaluate findings ourselves in a follow-up step.

3. Add a PowerShell "Fail build on Trivy findings" step that parses
   trivy-results.sarif, counts runs[*].results[*], prints a clear
   "N findings" message, links to the Security tab, and exits 1 if any
   findings were reported. This is locally testable (verified against
   both a clean SARIF and a synthetic SARIF with one finding), produces
   better log output than the previous behaviour, and is robust against
   whatever trivy-action internal path was misbehaving.

The Upload Trivy scan results step is unchanged and still runs on
if: always(), so SARIF findings reach the Security tab regardless of
whether the fail-build step fires.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nputs

The previous CI run produced 36 Trivy alerts in the Security tab (12 per
scan leg) despite the action being configured with severity: CRITICAL,HIGH.
Investigation of the alerts revealed:

- All 36 alerts are LOW severity openssl CVEs (CVE-2026-28387 through
  CVE-2026-31790). Trivy's own "Severity" field on each rule says LOW.
  Trivy's security_severity_level on each rule says "low". GitHub Code
  Scanning classifies them as "low" or "medium".

- The same Trivy version (0.69.3) running locally against the same image
  digests with the same environment variables returned zero findings -
  correctly filtering LOW-severity CVEs out at the source.

- The trivy-action wrapper uses a set_env_var_if_provided shell helper
  that writes TRIVY_SEVERITY=CRITICAL,HIGH to a temp file named
  trivy_envs.txt. Based on the CI log, this file is generated but the
  severity filter is not being honoured by Trivy at scan time, allowing
  LOW CVEs to leak through when format=sarif is in use.

Fix: remove the severity and ignore-unfixed inputs from the trivy-action
step entirely, and set TRIVY_SEVERITY and TRIVY_IGNORE_UNFIXED as
step-level environment variables instead. Trivy reads these directly
from its own environment, bypassing trivy-action's wrapper logic which
is the observed point of failure.

This preserves the existing structure (docker pull -> trivy scan ->
PowerShell SARIF evaluation -> code scanning upload) and only changes
how the filter flags reach Trivy. If step-level env vars still don't
take effect, the next escalation is running trivy as a direct shell
command via aquasecurity/setup-trivy, bypassing trivy-action entirely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Investigation of the previous CI run revealed why we kept getting
phantom HIGH findings:

- We queried code scanning for the 36 Trivy alerts on the PR ref and
  inspected each rule's properties.
- All 36 findings had CVSS security-severity scores in the 2.0-5.5
  range (low/medium), with tags ["LOW", "security", "vulnerability"].
- Despite TRIVY_SEVERITY=CRITICAL,HIGH being set both via the
  trivy-action input AND as a step-level environment variable,
  Trivy in CI did not filter these LOW-severity findings out of
  its SARIF output.
- Running the same Trivy version (0.69.3) locally with the exact
  same env vars correctly filtered them out, returning zero results.

Rather than continue diagnosing why Trivy's severity filter is
unreliable when running under trivy-action in CI, switch to a
two-stage approach:

  1. Trivy scans without a severity filter and writes everything
     it finds to the SARIF file. This is reliable.

  2. The PowerShell evaluation step reads each rule's CVSS score
     from rule.properties.'security-severity' (the same field
     GitHub Code Scanning uses to classify alerts), and counts only
     findings with CVSS >= 7.0 as blocking (HIGH or CRITICAL).

This is strictly better than relying on Trivy's filter:

- It uses CVSS as the source of truth, which is the industry
  standard for severity.
- It matches how GitHub Code Scanning classifies alerts in the
  Security tab, so our gate and the Security tab agree.
- It is fully testable locally (verified with both a real Trivy
  SARIF containing 17 results [14 LOW, 3 MEDIUM, 0 HIGH/CRITICAL]
  and a synthetic SARIF containing 1 LOW + 1 HIGH + 1 CRITICAL).
- It surfaces a clear severity breakdown in the CI log
  ("CRITICAL: 0, HIGH: 0, MEDIUM: 3, LOW: 14") and lists each
  blocking CVE by ID and CVSS when failing.
- It is robust against trivy-action wrapper bugs.

TRIVY_IGNORE_UNFIXED=true is preserved as a step env var so we
do not block on CVEs that have no upstream fix yet.

For the current state of the digest-pinned base images, this
parser correctly classifies every finding as LOW or MEDIUM, so
all three scan legs should now pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the scan-base-images CI gate fires on a fixable HIGH/CRITICAL CVE
that lives in a Microsoft-published base image layer (e.g., Ubuntu
package CVEs in dotnet/runtime:10.0-noble that have an upstream fix
but have not yet been absorbed into a refreshed Microsoft image),
the JIM project cannot apply the fix directly. The fix has to come
from a Microsoft rebuild, which happens on its own cadence.

This commit documents what to do in that situation:

- engineering/DEVELOPER_GUIDE.md gains a new subsection
  ("When the scan-base-images gate blocks on an upstream-only CVE")
  under the existing Docker Base Images section. It explains the four
  available response options in order of preference (wait for the
  Microsoft rebuild, in-Dockerfile apt-get upgrade, temporary gate
  threshold downgrade, or alert dismissal in the Security tab) and
  explicitly forbids continue-on-error: true as a permanent workaround.

- engineering/COMPLIANCE_MAPPING.md gains a new "Operational
  Considerations" section that briefly describes the situation,
  acknowledges it as a known limitation of digest-pinned base images,
  clarifies it is not a compliance gap (digest pinning, scanning, and
  SBOM generation all still operate correctly), and links readers to
  the developer guide for the response procedure.

There is no code change in this commit. The operational reality has
not changed; the documentation is being added now because the
investigation that produced PR #520 surfaced the question and we
want the answer captured before it is forgotten.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@JayVDZ JayVDZ merged commit e0e3aa8 into main Apr 10, 2026
14 checks passed
@JayVDZ JayVDZ deleted the fix/ci-base-image-scan-policy branch April 10, 2026 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants