chore(ci): fix Dependabot Docker paths after src/ refactor#504
Merged
Conversation
The three Docker base image blocks pointed at /JIM.Web, /JIM.Worker, and /JIM.Scheduler, but those paths ceased to exist when commit 70b3555 ("refactor: Move source projects into src/ directory to declutter repo root") relocated the Dockerfiles under src/. Dependabot has therefore been silently failing to scan any of the three Dockerfiles since mid-February, leaving base image digests stale and allowing OS-level CVEs to accumulate in release SBOMs. Consolidate the three near-identical blocks into a single entry using the multi-directory 'directories' key (GitHub, June 2024), pointing at the correct src/ paths. Also include /.devcontainer, which uses a Dockerfile but was never tracked by Dependabot. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
JayVDZ
added a commit
that referenced
this pull request
Apr 10, 2026
The previous scan-base-images job had three gaps that together weakened
JIM's supply chain compliance posture:
1. Scheduler was never scanned. The matrix had three legs but only
covered two Dockerfiles (Web once, Worker twice for runtime+sdk).
Worker and Scheduler use the same base image digest today, but
"they're identical" is a dangerous assumption for security-critical
scanning: nothing prevents them from drifting apart in a future
commit with no scan coverage on the Scheduler leg.
2. Digest-pinning was policy, not enforcement. engineering/DEVELOPER_GUIDE.md
states that production Dockerfiles must pin base images by @sha256:
digest, but there was no CI check enforcing it. A future commit could
silently remove a digest and the change would pass all existing CI.
3. The matrix was hand-maintained. Adding a new production Dockerfile
required a manual ci.yml edit; a forgotten edit would silently
leave the new Dockerfile unscanned. This is exactly the class of
drift that caused the Dependabot path bug fixed in #504.
Rewrite scan-base-images as a two-job discovery-then-scan pattern:
- discover-base-images: a new PowerShell script at
.github/scripts/discover-base-images.ps1 walks the repository for
files named Dockerfile, identifies production images by the
machine-readable directive "# jim-compliance: production-image" on
their own line, parses every external FROM line, enforces
digest-pinning, and emits a deduplicated matrix of unique image
references for the downstream scan job. Zero production Dockerfiles
or any non-digest-pinned FROM in a production Dockerfile fails the
build with a clear message.
- scan-base-images: now consumes the dynamic matrix via
needs.discover-base-images.outputs.matrix. Trivy now emits SARIF
instead of table format and uploads findings to GitHub code scanning
(security-events: write permission scoped to this job only), so
vulnerabilities are surfaced in the Security tab and auditable after
the fact, not just visible to whoever happens to read the Actions
log. The severity threshold (CRITICAL,HIGH), exit-code behaviour,
and ignore-unfixed setting are all preserved from the existing job.
The three production Dockerfiles (src/JIM.Web, src/JIM.Worker,
src/JIM.Scheduler) are labelled with the compliance directive. The
.devcontainer/Dockerfile and integration test fixtures under
test/integration/docker/ are deliberately left unlabelled: they are
dev and test infrastructure, not customer-shipped artefacts, and
tracking upstream tags is the correct behaviour for them.
Adding a new production Dockerfile now requires only adding the
compliance directive to the file itself; discovery and scanning are
automatic. The approach is locality-of-reference correct: the policy
lives with the artefact, eliminating the class of drift that caused
the Dependabot path breakage.
engineering/COMPLIANCE_MAPPING.md is updated to reflect that:
- Base image digest-pinning is now machine-enforced, strengthening
alignment with NIST CSF GV.SC (Supply Chain Risk Management),
UK Software Security Code of Practice Principle 7 (Manage and
secure third-party components), and NIST SP 800-53 SI-3
(Malicious Code Protection).
- A "Planned" entry is added under Code of Practice Principle 8
(Deploy securely) and NIST SP 800-53 SA-11 (Developer Testing and
Evaluation) referencing #518, which tracks the future pre-release
integration test gate.
- Document version bumped 1.0 -> 1.1.
engineering/DEVELOPER_GUIDE.md is updated to document the compliance
directive convention and the CI enforcement so future contributors
know how to label new production Dockerfiles.
Related tracking issues created alongside this change:
- #517: Pin all GitHub Actions by commit SHA (v0.9-STABILISATION)
- #518: Release gate for full integration test suite (v1.0-ILM-COMPLETE)
- #519: Continuous SBOM generation on main (v1.0-ILM-COMPLETE)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9 tasks
JayVDZ
added a commit
that referenced
this pull request
Apr 10, 2026
#520) * chore(ci): enforce digest-pinning and cover all production base images The previous scan-base-images job had three gaps that together weakened JIM's supply chain compliance posture: 1. Scheduler was never scanned. The matrix had three legs but only covered two Dockerfiles (Web once, Worker twice for runtime+sdk). Worker and Scheduler use the same base image digest today, but "they're identical" is a dangerous assumption for security-critical scanning: nothing prevents them from drifting apart in a future commit with no scan coverage on the Scheduler leg. 2. Digest-pinning was policy, not enforcement. engineering/DEVELOPER_GUIDE.md states that production Dockerfiles must pin base images by @sha256: digest, but there was no CI check enforcing it. A future commit could silently remove a digest and the change would pass all existing CI. 3. The matrix was hand-maintained. Adding a new production Dockerfile required a manual ci.yml edit; a forgotten edit would silently leave the new Dockerfile unscanned. This is exactly the class of drift that caused the Dependabot path bug fixed in #504. Rewrite scan-base-images as a two-job discovery-then-scan pattern: - discover-base-images: a new PowerShell script at .github/scripts/discover-base-images.ps1 walks the repository for files named Dockerfile, identifies production images by the machine-readable directive "# jim-compliance: production-image" on their own line, parses every external FROM line, enforces digest-pinning, and emits a deduplicated matrix of unique image references for the downstream scan job. Zero production Dockerfiles or any non-digest-pinned FROM in a production Dockerfile fails the build with a clear message. - scan-base-images: now consumes the dynamic matrix via needs.discover-base-images.outputs.matrix. Trivy now emits SARIF instead of table format and uploads findings to GitHub code scanning (security-events: write permission scoped to this job only), so vulnerabilities are surfaced in the Security tab and auditable after the fact, not just visible to whoever happens to read the Actions log. The severity threshold (CRITICAL,HIGH), exit-code behaviour, and ignore-unfixed setting are all preserved from the existing job. The three production Dockerfiles (src/JIM.Web, src/JIM.Worker, src/JIM.Scheduler) are labelled with the compliance directive. The .devcontainer/Dockerfile and integration test fixtures under test/integration/docker/ are deliberately left unlabelled: they are dev and test infrastructure, not customer-shipped artefacts, and tracking upstream tags is the correct behaviour for them. Adding a new production Dockerfile now requires only adding the compliance directive to the file itself; discovery and scanning are automatic. The approach is locality-of-reference correct: the policy lives with the artefact, eliminating the class of drift that caused the Dependabot path breakage. engineering/COMPLIANCE_MAPPING.md is updated to reflect that: - Base image digest-pinning is now machine-enforced, strengthening alignment with NIST CSF GV.SC (Supply Chain Risk Management), UK Software Security Code of Practice Principle 7 (Manage and secure third-party components), and NIST SP 800-53 SI-3 (Malicious Code Protection). - A "Planned" entry is added under Code of Practice Principle 8 (Deploy securely) and NIST SP 800-53 SA-11 (Developer Testing and Evaluation) referencing #518, which tracks the future pre-release integration test gate. - Document version bumped 1.0 -> 1.1. engineering/DEVELOPER_GUIDE.md is updated to document the compliance directive convention and the CI enforcement so future contributors know how to label new production Dockerfiles. Related tracking issues created alongside this change: - #517: Pin all GitHub Actions by commit SHA (v0.9-STABILISATION) - #518: Release gate for full integration test suite (v1.0-ILM-COMPLETE) - #519: Continuous SBOM generation on main (v1.0-ILM-COMPLETE) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore(ci): disable Trivy cache and pin least-privilege perms on metrics-sync Two workflow hardening fixes surfaced by the first CI run on this PR: 1. Disable Trivy DB cache on scan-base-images. In the first CI run, all three Trivy scan legs exited 1 within ~12ms of logging "Detecting vulnerabilities", without writing any findings to the SARIF file. Running Trivy v0.69.3 locally against the exact same image digests with the exact same flags returned exit 0 with zero findings in every leg, confirming the base images are actually clean. The phantom exit 1 in CI tracks to a corrupted Trivy vuln DB cache restored from key "cache-trivy-2026-04-10". Setting cache: 'false' on the scan step sidesteps the cache and forces a fresh DB download per run, which adds ~5-10s per matrix leg but is acceptable for a security-critical scan. 2. Add explicit "permissions: contents: read" at the workflow level on metrics-sync.yml. This satisfies the CodeQL GitHub Actions analyser finding "Workflow does not contain permissions" (actions/missing-workflow-permissions, #102) which surfaced in the Security tab when SARIF upload started working. The workflow only needs to checkout + git diff (reads) and then dispatch to TetronIO/jim-metrics via a separate PAT stored in secrets.METRICS_REPO_DISPATCH_TOKEN, so the default GITHUB_TOKEN needs no writes. contents: read is the correct minimum. Compliance alignment: both fixes strengthen UK Software Security Code of Practice Principle 5 (Protect the build environment) and NIST CSF GV.SC (Supply Chain Risk Management). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore(ci): pre-pull base image and evaluate SARIF findings explicitly The previous CI run still had all three scan-base-images legs failing with exit 1 despite disabling the Trivy DB cache. Further investigation showed this was caused by image acquisition, not vulnerability detection: - Running Trivy v0.69.3 locally with the exact CI env vars against the exact image digest via docker-in-docker always returned exit 0 with zero findings written to a valid SARIF file. - The CI log for each scan leg showed Trivy reaching the "Detecting vulnerabilities" INFO line and then exiting 1 ~5.7 seconds later with no further output. - Locally, Trivy's DEBUG output showed it was finding the image via source="docker" (the local Docker daemon), which had the image pre-pulled as part of this session's earlier troubleshooting. - On GitHub-hosted runners, there is no pre-pulled image, and the trivy-action's image acquisition path appears to fail silently when combined with format=sarif output, producing exit 1 with no finding content. Three fixes: 1. Add an explicit "docker pull" step before the Trivy scan. This guarantees Trivy finds the image via source="docker" and skips whatever acquisition path was silently failing. 2. Set trivy-action exit-code from 1 to 0. The action's exit-code mechanism is the observed point of failure. Instead of trusting it, we evaluate findings ourselves in a follow-up step. 3. Add a PowerShell "Fail build on Trivy findings" step that parses trivy-results.sarif, counts runs[*].results[*], prints a clear "N findings" message, links to the Security tab, and exits 1 if any findings were reported. This is locally testable (verified against both a clean SARIF and a synthetic SARIF with one finding), produces better log output than the previous behaviour, and is robust against whatever trivy-action internal path was misbehaving. The Upload Trivy scan results step is unchanged and still runs on if: always(), so SARIF findings reach the Security tab regardless of whether the fail-build step fires. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore(ci): force Trivy severity filter via step env vars not action inputs The previous CI run produced 36 Trivy alerts in the Security tab (12 per scan leg) despite the action being configured with severity: CRITICAL,HIGH. Investigation of the alerts revealed: - All 36 alerts are LOW severity openssl CVEs (CVE-2026-28387 through CVE-2026-31790). Trivy's own "Severity" field on each rule says LOW. Trivy's security_severity_level on each rule says "low". GitHub Code Scanning classifies them as "low" or "medium". - The same Trivy version (0.69.3) running locally against the same image digests with the same environment variables returned zero findings - correctly filtering LOW-severity CVEs out at the source. - The trivy-action wrapper uses a set_env_var_if_provided shell helper that writes TRIVY_SEVERITY=CRITICAL,HIGH to a temp file named trivy_envs.txt. Based on the CI log, this file is generated but the severity filter is not being honoured by Trivy at scan time, allowing LOW CVEs to leak through when format=sarif is in use. Fix: remove the severity and ignore-unfixed inputs from the trivy-action step entirely, and set TRIVY_SEVERITY and TRIVY_IGNORE_UNFIXED as step-level environment variables instead. Trivy reads these directly from its own environment, bypassing trivy-action's wrapper logic which is the observed point of failure. This preserves the existing structure (docker pull -> trivy scan -> PowerShell SARIF evaluation -> code scanning upload) and only changes how the filter flags reach Trivy. If step-level env vars still don't take effect, the next escalation is running trivy as a direct shell command via aquasecurity/setup-trivy, bypassing trivy-action entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore(ci): filter Trivy findings by CVSS score in PowerShell, not Trivy Investigation of the previous CI run revealed why we kept getting phantom HIGH findings: - We queried code scanning for the 36 Trivy alerts on the PR ref and inspected each rule's properties. - All 36 findings had CVSS security-severity scores in the 2.0-5.5 range (low/medium), with tags ["LOW", "security", "vulnerability"]. - Despite TRIVY_SEVERITY=CRITICAL,HIGH being set both via the trivy-action input AND as a step-level environment variable, Trivy in CI did not filter these LOW-severity findings out of its SARIF output. - Running the same Trivy version (0.69.3) locally with the exact same env vars correctly filtered them out, returning zero results. Rather than continue diagnosing why Trivy's severity filter is unreliable when running under trivy-action in CI, switch to a two-stage approach: 1. Trivy scans without a severity filter and writes everything it finds to the SARIF file. This is reliable. 2. The PowerShell evaluation step reads each rule's CVSS score from rule.properties.'security-severity' (the same field GitHub Code Scanning uses to classify alerts), and counts only findings with CVSS >= 7.0 as blocking (HIGH or CRITICAL). This is strictly better than relying on Trivy's filter: - It uses CVSS as the source of truth, which is the industry standard for severity. - It matches how GitHub Code Scanning classifies alerts in the Security tab, so our gate and the Security tab agree. - It is fully testable locally (verified with both a real Trivy SARIF containing 17 results [14 LOW, 3 MEDIUM, 0 HIGH/CRITICAL] and a synthetic SARIF containing 1 LOW + 1 HIGH + 1 CRITICAL). - It surfaces a clear severity breakdown in the CI log ("CRITICAL: 0, HIGH: 0, MEDIUM: 3, LOW: 14") and lists each blocking CVE by ID and CVSS when failing. - It is robust against trivy-action wrapper bugs. TRIVY_IGNORE_UNFIXED=true is preserved as a step env var so we do not block on CVEs that have no upstream fix yet. For the current state of the digest-pinned base images, this parser correctly classifies every finding as LOW or MEDIUM, so all three scan legs should now pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: document upstream-only base image CVE response procedure When the scan-base-images CI gate fires on a fixable HIGH/CRITICAL CVE that lives in a Microsoft-published base image layer (e.g., Ubuntu package CVEs in dotnet/runtime:10.0-noble that have an upstream fix but have not yet been absorbed into a refreshed Microsoft image), the JIM project cannot apply the fix directly. The fix has to come from a Microsoft rebuild, which happens on its own cadence. This commit documents what to do in that situation: - engineering/DEVELOPER_GUIDE.md gains a new subsection ("When the scan-base-images gate blocks on an upstream-only CVE") under the existing Docker Base Images section. It explains the four available response options in order of preference (wait for the Microsoft rebuild, in-Dockerfile apt-get upgrade, temporary gate threshold downgrade, or alert dismissal in the Security tab) and explicitly forbids continue-on-error: true as a permanent workaround. - engineering/COMPLIANCE_MAPPING.md gains a new "Operational Considerations" section that briefly describes the situation, acknowledges it as a known limitation of digest-pinned base images, clarifies it is not a compliance gap (digest pinning, scanning, and SBOM generation all still operate correctly), and links readers to the developer guide for the response procedure. There is no code change in this commit. The operational reality has not changed; the documentation is being added now because the investigation that produced PR #520 surfaced the question and we want the answer captured before it is forgotten. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closed
15 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Dependabot's three Docker blocks for JIM.Web, JIM.Worker, and JIM.Scheduler have been silently failing for ~8 weeks. The paths reference the pre-`src/` layout and no longer match where the Dockerfiles live.
Root cause
a574e62to03a7d87in /JIM.Worker #321-chore(deps): Bump dotnet/sdk froma574e62to03a7d87in /JIM.Web #326 updating base image digests against `/JIM.Web/Dockerfile`, `/JIM.Worker/Dockerfile`, `/JIM.Scheduler/Dockerfile`.Fix
Consolidate the three near-identical Docker blocks into a single entry using the multi-directory `directories` key (GitHub, June 2024), pointing at the correct `src/` paths. Also include `.devcontainer/`, which ships a Dockerfile but was never tracked by Dependabot.
```yaml
directories:
```
All other settings (schedule, reviewers, groups, ignore rules) are preserved unchanged.
What to expect after merge
Test plan