chore(ci): enforce digest-pinning and cover all production base images#520
Merged
chore(ci): enforce digest-pinning and cover all production base images#520
Conversation
The previous scan-base-images job had three gaps that together weakened
JIM's supply chain compliance posture:
1. Scheduler was never scanned. The matrix had three legs but only
covered two Dockerfiles (Web once, Worker twice for runtime+sdk).
Worker and Scheduler use the same base image digest today, but
"they're identical" is a dangerous assumption for security-critical
scanning: nothing prevents them from drifting apart in a future
commit with no scan coverage on the Scheduler leg.
2. Digest-pinning was policy, not enforcement. engineering/DEVELOPER_GUIDE.md
states that production Dockerfiles must pin base images by @sha256:
digest, but there was no CI check enforcing it. A future commit could
silently remove a digest and the change would pass all existing CI.
3. The matrix was hand-maintained. Adding a new production Dockerfile
required a manual ci.yml edit; a forgotten edit would silently
leave the new Dockerfile unscanned. This is exactly the class of
drift that caused the Dependabot path bug fixed in #504.
Rewrite scan-base-images as a two-job discovery-then-scan pattern:
- discover-base-images: a new PowerShell script at
.github/scripts/discover-base-images.ps1 walks the repository for
files named Dockerfile, identifies production images by the
machine-readable directive "# jim-compliance: production-image" on
their own line, parses every external FROM line, enforces
digest-pinning, and emits a deduplicated matrix of unique image
references for the downstream scan job. Zero production Dockerfiles
or any non-digest-pinned FROM in a production Dockerfile fails the
build with a clear message.
- scan-base-images: now consumes the dynamic matrix via
needs.discover-base-images.outputs.matrix. Trivy now emits SARIF
instead of table format and uploads findings to GitHub code scanning
(security-events: write permission scoped to this job only), so
vulnerabilities are surfaced in the Security tab and auditable after
the fact, not just visible to whoever happens to read the Actions
log. The severity threshold (CRITICAL,HIGH), exit-code behaviour,
and ignore-unfixed setting are all preserved from the existing job.
The three production Dockerfiles (src/JIM.Web, src/JIM.Worker,
src/JIM.Scheduler) are labelled with the compliance directive. The
.devcontainer/Dockerfile and integration test fixtures under
test/integration/docker/ are deliberately left unlabelled: they are
dev and test infrastructure, not customer-shipped artefacts, and
tracking upstream tags is the correct behaviour for them.
Adding a new production Dockerfile now requires only adding the
compliance directive to the file itself; discovery and scanning are
automatic. The approach is locality-of-reference correct: the policy
lives with the artefact, eliminating the class of drift that caused
the Dependabot path breakage.
engineering/COMPLIANCE_MAPPING.md is updated to reflect that:
- Base image digest-pinning is now machine-enforced, strengthening
alignment with NIST CSF GV.SC (Supply Chain Risk Management),
UK Software Security Code of Practice Principle 7 (Manage and
secure third-party components), and NIST SP 800-53 SI-3
(Malicious Code Protection).
- A "Planned" entry is added under Code of Practice Principle 8
(Deploy securely) and NIST SP 800-53 SA-11 (Developer Testing and
Evaluation) referencing #518, which tracks the future pre-release
integration test gate.
- Document version bumped 1.0 -> 1.1.
engineering/DEVELOPER_GUIDE.md is updated to document the compliance
directive convention and the CI enforcement so future contributors
know how to label new production Dockerfiles.
Related tracking issues created alongside this change:
- #517: Pin all GitHub Actions by commit SHA (v0.9-STABILISATION)
- #518: Release gate for full integration test suite (v1.0-ILM-COMPLETE)
- #519: Continuous SBOM generation on main (v1.0-ILM-COMPLETE)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
You are seeing this message because GitHub Code Scanning has recently been set up for this repository, or this pull request contains the workflow file for the Code Scanning tool. What Enabling Code Scanning Means:
For more information about GitHub Code Scanning, check out the documentation. |
…cs-sync Two workflow hardening fixes surfaced by the first CI run on this PR: 1. Disable Trivy DB cache on scan-base-images. In the first CI run, all three Trivy scan legs exited 1 within ~12ms of logging "Detecting vulnerabilities", without writing any findings to the SARIF file. Running Trivy v0.69.3 locally against the exact same image digests with the exact same flags returned exit 0 with zero findings in every leg, confirming the base images are actually clean. The phantom exit 1 in CI tracks to a corrupted Trivy vuln DB cache restored from key "cache-trivy-2026-04-10". Setting cache: 'false' on the scan step sidesteps the cache and forces a fresh DB download per run, which adds ~5-10s per matrix leg but is acceptable for a security-critical scan. 2. Add explicit "permissions: contents: read" at the workflow level on metrics-sync.yml. This satisfies the CodeQL GitHub Actions analyser finding "Workflow does not contain permissions" (actions/missing-workflow-permissions, #102) which surfaced in the Security tab when SARIF upload started working. The workflow only needs to checkout + git diff (reads) and then dispatch to TetronIO/jim-metrics via a separate PAT stored in secrets.METRICS_REPO_DISPATCH_TOKEN, so the default GITHUB_TOKEN needs no writes. contents: read is the correct minimum. Compliance alignment: both fixes strengthen UK Software Security Code of Practice Principle 5 (Protect the build environment) and NIST CSF GV.SC (Supply Chain Risk Management). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous CI run still had all three scan-base-images legs failing with exit 1 despite disabling the Trivy DB cache. Further investigation showed this was caused by image acquisition, not vulnerability detection: - Running Trivy v0.69.3 locally with the exact CI env vars against the exact image digest via docker-in-docker always returned exit 0 with zero findings written to a valid SARIF file. - The CI log for each scan leg showed Trivy reaching the "Detecting vulnerabilities" INFO line and then exiting 1 ~5.7 seconds later with no further output. - Locally, Trivy's DEBUG output showed it was finding the image via source="docker" (the local Docker daemon), which had the image pre-pulled as part of this session's earlier troubleshooting. - On GitHub-hosted runners, there is no pre-pulled image, and the trivy-action's image acquisition path appears to fail silently when combined with format=sarif output, producing exit 1 with no finding content. Three fixes: 1. Add an explicit "docker pull" step before the Trivy scan. This guarantees Trivy finds the image via source="docker" and skips whatever acquisition path was silently failing. 2. Set trivy-action exit-code from 1 to 0. The action's exit-code mechanism is the observed point of failure. Instead of trusting it, we evaluate findings ourselves in a follow-up step. 3. Add a PowerShell "Fail build on Trivy findings" step that parses trivy-results.sarif, counts runs[*].results[*], prints a clear "N findings" message, links to the Security tab, and exits 1 if any findings were reported. This is locally testable (verified against both a clean SARIF and a synthetic SARIF with one finding), produces better log output than the previous behaviour, and is robust against whatever trivy-action internal path was misbehaving. The Upload Trivy scan results step is unchanged and still runs on if: always(), so SARIF findings reach the Security tab regardless of whether the fail-build step fires. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nputs The previous CI run produced 36 Trivy alerts in the Security tab (12 per scan leg) despite the action being configured with severity: CRITICAL,HIGH. Investigation of the alerts revealed: - All 36 alerts are LOW severity openssl CVEs (CVE-2026-28387 through CVE-2026-31790). Trivy's own "Severity" field on each rule says LOW. Trivy's security_severity_level on each rule says "low". GitHub Code Scanning classifies them as "low" or "medium". - The same Trivy version (0.69.3) running locally against the same image digests with the same environment variables returned zero findings - correctly filtering LOW-severity CVEs out at the source. - The trivy-action wrapper uses a set_env_var_if_provided shell helper that writes TRIVY_SEVERITY=CRITICAL,HIGH to a temp file named trivy_envs.txt. Based on the CI log, this file is generated but the severity filter is not being honoured by Trivy at scan time, allowing LOW CVEs to leak through when format=sarif is in use. Fix: remove the severity and ignore-unfixed inputs from the trivy-action step entirely, and set TRIVY_SEVERITY and TRIVY_IGNORE_UNFIXED as step-level environment variables instead. Trivy reads these directly from its own environment, bypassing trivy-action's wrapper logic which is the observed point of failure. This preserves the existing structure (docker pull -> trivy scan -> PowerShell SARIF evaluation -> code scanning upload) and only changes how the filter flags reach Trivy. If step-level env vars still don't take effect, the next escalation is running trivy as a direct shell command via aquasecurity/setup-trivy, bypassing trivy-action entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Investigation of the previous CI run revealed why we kept getting
phantom HIGH findings:
- We queried code scanning for the 36 Trivy alerts on the PR ref and
inspected each rule's properties.
- All 36 findings had CVSS security-severity scores in the 2.0-5.5
range (low/medium), with tags ["LOW", "security", "vulnerability"].
- Despite TRIVY_SEVERITY=CRITICAL,HIGH being set both via the
trivy-action input AND as a step-level environment variable,
Trivy in CI did not filter these LOW-severity findings out of
its SARIF output.
- Running the same Trivy version (0.69.3) locally with the exact
same env vars correctly filtered them out, returning zero results.
Rather than continue diagnosing why Trivy's severity filter is
unreliable when running under trivy-action in CI, switch to a
two-stage approach:
1. Trivy scans without a severity filter and writes everything
it finds to the SARIF file. This is reliable.
2. The PowerShell evaluation step reads each rule's CVSS score
from rule.properties.'security-severity' (the same field
GitHub Code Scanning uses to classify alerts), and counts only
findings with CVSS >= 7.0 as blocking (HIGH or CRITICAL).
This is strictly better than relying on Trivy's filter:
- It uses CVSS as the source of truth, which is the industry
standard for severity.
- It matches how GitHub Code Scanning classifies alerts in the
Security tab, so our gate and the Security tab agree.
- It is fully testable locally (verified with both a real Trivy
SARIF containing 17 results [14 LOW, 3 MEDIUM, 0 HIGH/CRITICAL]
and a synthetic SARIF containing 1 LOW + 1 HIGH + 1 CRITICAL).
- It surfaces a clear severity breakdown in the CI log
("CRITICAL: 0, HIGH: 0, MEDIUM: 3, LOW: 14") and lists each
blocking CVE by ID and CVSS when failing.
- It is robust against trivy-action wrapper bugs.
TRIVY_IGNORE_UNFIXED=true is preserved as a step env var so we
do not block on CVEs that have no upstream fix yet.
For the current state of the digest-pinned base images, this
parser correctly classifies every finding as LOW or MEDIUM, so
all three scan legs should now pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Apr 10, 2026
When the scan-base-images CI gate fires on a fixable HIGH/CRITICAL CVE
that lives in a Microsoft-published base image layer (e.g., Ubuntu
package CVEs in dotnet/runtime:10.0-noble that have an upstream fix
but have not yet been absorbed into a refreshed Microsoft image),
the JIM project cannot apply the fix directly. The fix has to come
from a Microsoft rebuild, which happens on its own cadence.
This commit documents what to do in that situation:
- engineering/DEVELOPER_GUIDE.md gains a new subsection
("When the scan-base-images gate blocks on an upstream-only CVE")
under the existing Docker Base Images section. It explains the four
available response options in order of preference (wait for the
Microsoft rebuild, in-Dockerfile apt-get upgrade, temporary gate
threshold downgrade, or alert dismissal in the Security tab) and
explicitly forbids continue-on-error: true as a permanent workaround.
- engineering/COMPLIANCE_MAPPING.md gains a new "Operational
Considerations" section that briefly describes the situation,
acknowledges it as a known limitation of digest-pinned base images,
clarifies it is not a compliance gap (digest pinning, scanning, and
SBOM generation all still operate correctly), and links readers to
the developer guide for the response procedure.
There is no code change in this commit. The operational reality has
not changed; the documentation is being added now because the
investigation that produced PR #520 surfaced the question and we
want the answer captured before it is forgotten.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Apr 10, 2026
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Refactors
scan-base-imagesin CI from a hand-maintained static matrix into a discover-then-scan pattern that (1) covers every production Dockerfile automatically, (2) enforces the digest-pinning policy as a CI gate rather than an aspirational convention, and (3) surfaces vulnerability findings to GitHub code scanning via SARIF rather than only to Actions logs.Why
The previous job had three compounding gaps that together weakened JIM's supply chain compliance posture:
src/JIM.Scheduler/Dockerfilewas never scanned. The static matrix had three legs but only referenced two Dockerfiles (Web once, Worker twice for runtime+sdk). Worker and Scheduler happen to use the same base image digest today, but "they're identical" is a dangerous assumption for security-critical scanning: nothing prevents a future commit from drifting them apart with zero scan coverage on the Scheduler leg.Digest-pinning was documented policy, not enforced policy. engineering/DEVELOPER_GUIDE.md states that production Dockerfiles must pin base images by
@sha256:digest, but no CI check enforced it. A future commit could silently remove a digest and the change would pass all existing CI.The matrix was hand-maintained. Adding a new production Dockerfile required a manual
ci.ymledit. A forgotten edit would silently leave the new Dockerfile unscanned. This is exactly the class of drift that caused the Dependabot path breakage fixed in chore(ci): fix Dependabot Docker paths after src/ refactor #504 (eight weeks of silent Dependabot failure after Dockerfiles moved undersrc/).What changes
New:
.github/scripts/discover-base-images.ps1PowerShell discovery script that walks the repository for
Dockerfiles, identifies production images by the machine-readable directive# jim-compliance: production-image, parses every externalFROMline, enforces digest-pinning, and emits a deduplicated matrix of unique image references for downstream scanning.FROM ... AS <alias>entries and skipping intra-file stage referencesFROM scratch.devcontainer/Dockerfileis found and correctly reported as unlabelled)FROMpwsh -NoProfile -File .github/scripts/discover-base-images.ps1Changed:
.github/workflows/ci.ymlReplaces the single static
scan-base-imagesjob with a two-job pattern:discover-base-images: runs the discovery script, emits the matrix as job outputscan-base-images: consumesneeds.discover-base-images.outputs.matrixviafromJSON, scans each unique image with Trivy, emits SARIF (not table), and uploads to GitHub code scanning.security-events: writepermission is scoped to this job onlyTrivy settings preserved from the existing job:
severity: CRITICAL,HIGH,exit-code: 1,ignore-unfixed: true. The only scan behaviour that changes is the output format (table -> SARIF) and the addition of the code scanning upload.Changed: Three production Dockerfiles
src/JIM.Web/Dockerfile,src/JIM.Worker/Dockerfile,src/JIM.Scheduler/Dockerfileeach carry a new directive block:The directive is both machine-readable (discovery script greps for it) and self-documenting (a human reading the file immediately understands the policy scope without cross-referencing external docs). The approach is locality-of-reference correct: the policy lives with the artefact it governs.
.devcontainer/Dockerfileandtest/integration/docker/**/Dockerfileare deliberately left unlabelled. They are developer and test infrastructure, not customer-shipped artefacts, and tracking upstream tags is the correct behaviour for them.Changed:
engineering/COMPLIANCE_MAPPING.mdVersion bumped 1.0 -> 1.1. Three "Implemented" entries strengthened to reflect that digest-pinning is now machine-enforced:
Two "Planned" entries added referencing #518 for the future pre-release integration test gate:
Changed:
engineering/DEVELOPER_GUIDE.mdThe "Docker Base Images" subsection under "Dependency Pinning and Updates" is extended to document the compliance directive convention, how CI enforces it, and a "how to add a new production Dockerfile" recipe for future contributors.
Related tracking issues
Created alongside this change to capture follow-up compliance work:
Test plan
python3 -c 'import yaml; yaml.safe_load(...)')runtimedigest; discovery emits only one matrix entry for itFROM build AS publishandFROM base AS finalcorrectly skippedpwsh(the repo.gitattributesforces*.ps1 text eol=crlf)fromJSONmatrix works as expected — post-merge verification: everyci.ymlrun onmainsince the merge has a successfuldiscover-base-images->scan-base-imagesdependency chain with the matrix legs dynamically expanded (confirmed on runs 24259797536, 24260333867, 24260337415, 24264679695, 24265336670).gh api repos/TetronIO/JIM/code-scanning/analyses: 15 Trivy analyses have been uploaded post-merge, each tagged with a category matching thetrivy-base-image-<dockerfile>-line<N>template fromci.yml, proving the SARIF upload path is working end-to-end.release.ymlaudited to confirm it does not reference the old job structure) — confirmed:release.ymlcontains zero references toscan-base-imagesor the old static matrix structure, and allci.ymlruns onmaincontinue to pass.What to expect after merge
New check names on PRs. The matrix leg names change from e.g.
scan-base-images (aspnet, src/JIM.Web/Dockerfile)to the GitHub-derived leg name based on the dynamic matrix content. Ifmain's branch protection requires any of the old names by string match, those requirements will need to be updated to the new names.GitHub Security tab populates for the first time. Any unfixed CRITICAL/HIGH CVEs in the current base images (there are several; see SBOM Observer screenshot from the conversation that spawned this work) will appear as informational findings in the Security tab. They will NOT fail the build (
ignore-unfixed: trueis preserved) but they will be visible and auditable.Adding a new production Dockerfile becomes zero-config. Drop a
Dockerfilewith# jim-compliance: production-imageand digest-pinnedFROMlines into any directory. Discovery picks it up automatically on the next CI run. No workflow edit required.