Skip to content

fix(ci): unbreak diagnostics under ansible-core 2.19 and clean up CI noise#128

Merged
Oddly merged 8 commits intomainfrom
fix/ci-diagnostics-cleanup
Apr 13, 2026
Merged

fix(ci): unbreak diagnostics under ansible-core 2.19 and clean up CI noise#128
Oddly merged 8 commits intomainfrom
fix/ci-diagnostics-cleanup

Conversation

@Oddly
Copy link
Copy Markdown
Owner

@Oddly Oddly commented Apr 13, 2026

This branch gathers a handful of fixes that came out of investigating the scheduled runs from 13 April. The root causes are independent but the changes are small enough that bundling them felt cleaner than splitting across five PRs.

The elasticsearch_diagnostics scenario was reaching restart_and_verify_elasticsearch.yml via include_tasks with an absolute path, which stopped working in ansible-core 2.19 because role_path no longer leaks across play boundaries from a previous include_role. Switching to include_role with tasks_from restores the role context and drops the env-lookup path construction. I also pinned ansible-core below 2.19 in requirements-test.txt until the other ~50 {{ role_path }}/.. references in production role tasks are audited; the change converted the one we know about but the rest are not covered by the elasticsearch_diagnostics scenario and I want a clean run before re-opening the ceiling.

The collect-diagnostics action was crying wolf on OOMs. It pulled logs from every running container on the Incus host and scanned the shared kernel dmesg ring buffer without any time filter, so each job inherited logs from concurrent matrix siblings and weeks of historical host-level OOMs and the summary lit up red even when the failing run had no memory issue. It now filters containers by MOLECULE_RUN_SUFFIX and the OOM summary drops dmesg and ES log entries whose own timestamps predate a three-hour window. Cgroup memory.events are already per-container and stay as-is.

Dependabot PRs were bouncing off Test Linting because the ansible-galaxy install step referenced secrets.INCUS_HOST, which Dependabot does not inherit; the URL collapsed to http://:8082/... and the install DNS-failed. All four workflows that install from the cache now export CACHE_HOST and fall back to the pinned versions on upstream Galaxy when it is empty. Scheduled and manually-triggered runs keep using the internal cache.

Two cosmetic cleanups on top: a placeholder molecule/default/molecule.yml to silence the CRITICAL glob-failed line molecule 26 emits on every destroy, and a bump of the collect-diagnostics upload step to upload-artifact@v5 ahead of the Node 20 cutoff.

Finally, the logstash_custom_pipeline verify had a race where wait_for only checked for the output file's existence while Logstash was still flushing the generator events through batch; adding search_regex: processed_by makes it wait on actual content. That explains the sporadic fail on rockylinux9 + release 8 specifically, where the runner happened to be slow enough to tip it over.

The rockylinux9 + release 8 timeout on repos_default (converge 71s, idempotence cancelled at the 30m job timeout) is not addressed here. The scenario is minimal enough that the 27-minute idempotence run is almost certainly a host-level stall, but I want a second datapoint before tightening anything in the role. A workflow_dispatch of Test Role repos on this branch will either reproduce or clear it.

Summary by CodeRabbit

  • New Features

    • Optional caching layer for installing Ansible collections; falls back to Galaxy if unavailable.
    • Added placeholder Molecule scenario to avoid false missing-scenario reports.
  • Bug Fixes

    • Improved container isolation in concurrent test jobs by suffix-based filtering.
    • OOM detection now limits matches to recent timestamps (Elasticsearch and kernel logs).
    • Extended Logstash verification wait to check content over a longer timeout.
  • Chores

    • Tightened ansible-core test constraint.
    • Upgraded artifact upload action to v5.
    • Updated CI scenario exclusion list.

Oddly added 7 commits April 13, 2026 10:25
The scenario reached restart_and_verify_elasticsearch.yml via include_tasks with
an absolute path, so it ran outside any role context. In 2.19 that stopped
working because the included file resolves {{ role_path }} lazily, and role_path
no longer leaks across play boundaries from the earlier include_role. Switching
both callers to include_role with tasks_from re-establishes the role context
and drops the env-lookup path construction along the way.
The OOM report was crying wolf because the collector pulled logs from every
running container on the Incus host and the OOM detection itself scanned the
shared kernel dmesg ring buffer without a time filter. A single matrix job
therefore inherited both its concurrent siblings' logs and weeks of historical
host-level OOMs, and the summary lit up red on runs that had no memory issue.

Now the container discovery and the cgroup memory report both filter by
MOLECULE_RUN_SUFFIX so we only look at containers this job created, and the
OOM summary parses the bracketed timestamps on dmesg and ES log lines and
drops anything older than three hours.
Dependabot pull requests do not inherit repo secrets, so INCUS_HOST expands
to an empty string and the internal collection cache URL collapses to
http://:8082/..., which the install step then fails on with a DNS error.
Every dependabot PR has been bouncing off Test Linting as a result.

Each of the four workflows that install collections from the cache now
exports CACHE_HOST via env and switches to installing the same pinned
community.general, community.crypto and ansible.posix versions from
upstream Galaxy when CACHE_HOST is empty. Scheduled and manually-triggered
runs continue to use the internal cache unchanged.
The previous <2.21 ceiling pulled in 2.19.8, which tightened variable
scoping so that role_path no longer leaks across play boundaries from an
earlier include_role. The elasticsearch_diagnostics scenario tripped over
this, and there are roughly fifty other {{ role_path }}/.. references in
production role tasks whose behaviour under 2.19 we have not yet validated
end-to-end. Capping at <2.19 buys time to convert those to include_role
with tasks_from (or equivalent) before letting 2.19 through again.
Molecule 26 globs molecule/default/molecule.yml for shared-state discovery
on every invocation and emits a CRITICAL line when the file is missing,
which turns up in every molecule destroy step in CI and makes the logs
harder to scan when something actually goes wrong. Every workflow passes
-s <scenario>, so this placeholder is never executed; it only exists so
the glob succeeds.
The custom_pipeline verify task was waiting on the output file path only, but
logstash opens the file when the pipeline starts and the generator events
still have to flow through batch and flush before anything is written. On
slower runners (rockylinux9 + release 8 in particular, where the whole job
took 17m against 2-4m for the other forty-one combos) the slurp ran before
the first event had been flushed, so the assertion saw an empty file and the
scenario failed intermittently.

Waiting on search_regex: processed_by removes the race: the task only
returns once actual content has been written. I bumped the timeout to 120s
since we are now legitimately waiting on pipeline work rather than a path
appearing in the filesystem.
v4 runs on Node.js 20, which GitHub Actions is forcing off the runner on
2nd June 2026 and removing entirely on 16th September. v5 uses Node.js 24
and is otherwise API-compatible for our call sites; no workflow changes
needed. The deprecation warning shows up in every failed matrix job right
now, which is more log noise we can remove ahead of the cutoff.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 13, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c02bea6b-f515-4621-92ac-4cc3760388b8

📥 Commits

Reviewing files that changed from the base of the PR and between bfb3ef2 and 575ba49.

📒 Files selected for processing (1)
  • scripts/check-ci-coverage.sh

📝 Walkthrough

Walkthrough

Workflows now conditionally install Ansible collections from a cache host or upstream Galaxy; diagnostics action adds container suffix filtering and time-bounded OOM detection; Molecule playbooks switch task inclusion to role-based lookup and add a placeholder scenario; Ansible Core test constraint tightened; small script and verification timeouts adjusted.

Changes

Cohort / File(s) Summary
Diagnostics & Container Management
.github/actions/collect-diagnostics/action.yml
Adds MOLECULE_RUN_SUFFIX filtering for discovered Incus containers; OOM checks limit matches to entries within a 3-hour cutoff using timestamp parsing for Elasticsearch and dmesg; upgrades actions/upload-artifact from v4v5.
Workflow Collection Installation
.github/workflows/molecule.yml, .github/workflows/test_elasticsearch_upgrade.yml, .github/workflows/test_full_stack.yml, .github/workflows/test_linting.yml
Replaces unconditional tarball installs with a shell-if: set CACHE_HOST from secrets.INCUS_HOST; if non-empty use http://${CACHE_HOST}:8082/... tarballs, else install by Galaxy namespace/version (community.general:12.3.0, community.crypto:3.1.1, ansible.posix:2.1.0).
Molecule Scenarios & Tasks
molecule/default/molecule.yml, molecule/elasticsearch_diagnostics/converge.yml
Adds placeholder Molecule scenario file; switches include_tasks (path-based) to include_role with tasks_from to load restart_and_verify_elasticsearch.yml.
Verification Timeout
molecule/logstash_custom_pipeline/verify.yml
Replaces "file exists" check with content-based wait search_regex: processed_by, extends timeout from 60s → 120s, updates failure message.
Dependencies
requirements-test.txt
Narrows ansible-core upper bound from <2.21 to <2.19 and adds explanatory comment about role_path/variable-scoping behavior in 2.19.
CI Coverage Script
scripts/check-ci-coverage.sh
Adjusts EXCLUDED_SCENARIOS to exclude default shared plugins instead of only shared plugins.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and specifically describes the main changes: fixing diagnostics compatibility with ansible-core 2.19 and reducing CI noise, which aligns with the core objectives of the pull request.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/ci-diagnostics-cleanup

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
requirements-test.txt (1)

1-6: Add a tracked removal condition for the temporary ansible-core upper bound.

Line 5 says “Revisit,” but there’s no issue/ticket/date anchor. Please add one so this compatibility pin does not become permanent unintentionally.

Suggested tweak
 # (the elasticsearch_diagnostics molecule scenario hit this). Revisit once
-# the full matrix is known to pass on 2.19.
+# the full matrix is known to pass on 2.19.
+# TODO(issue-XXX): remove this upper bound after role_path include-path audit.
 ansible-core>=2.18,<2.19
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@requirements-test.txt` around lines 1 - 6, Update the temporary upper bound
comment for the ansible-core pin in requirements-test.txt (the line with
"ansible-core>=2.18,<2.19") to include a tracked removal condition — add an
issue/ticket ID or a target date and a short rationale so the pin is revisited;
ensure the comment mentions the tracker (e.g., "See ISSUE-XXXX" or "Remove by
2026-07-01 if tests pass") and keeps the existing explanation about role_path
scoping so future maintainers know why the bound exists.
.github/workflows/test_elasticsearch_upgrade.yml (1)

88-104: Use the reusable Molecule workflow instead of copying this bootstrap block.

This branch is now another copy of the collection setup already owned by .github/workflows/molecule.yml, so the next collection pin change will need to be kept in sync across multiple files again. Keep the matrix here if needed, but move the job body behind the reusable workflow.

As per coding guidelines, "Molecule test workflows should use the reusable molecule.yml workflow."

Also applies to: 203-219

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test_elasticsearch_upgrade.yml around lines 88 - 104, The
bootstrap "Install collection" block duplicates collection setup that lives in
the reusable workflow .github/workflows/molecule.yml; replace the inline
installation steps in test_elasticsearch_upgrade.yml with a workflow_call to the
reusable molecule.yml workflow, forwarding the existing matrix and any required
inputs (e.g., collection pins or CACHE_HOST) and environment vars so the
reusable job executes the collection install; remove the duplicated
ansible-galaxy steps (including the if/else CACHE_HOST logic) from this file and
ensure the same change is applied to the other duplicate block referenced (lines
203-219) so all Molecule test workflows use the single reusable molecule.yml.
.github/workflows/test_full_stack.yml (1)

116-132: Route this matrix through .github/workflows/molecule.yml.

This install block is identical to the reusable workflow’s bootstrap logic, which brings back drift every time these collection pins change. Calling the reusable Molecule workflow here would keep one source of truth.

As per coding guidelines, "Molecule test workflows should use the reusable molecule.yml workflow."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test_full_stack.yml around lines 116 - 132, Replace the
duplicated "Install collection" step with a call to the reusable Molecule
workflow (.github/workflows/molecule.yml) so this matrix is routed through that
single source of truth: remove the block that creates ANSIBLE_COLLECTIONS_PATH
and the ansible-galaxy install commands and instead invoke the reusable workflow
(using workflow_call) passing the existing matrix/inputs and secrets (e.g.,
CACHE_HOST) through; ensure the job/step names referencing the matrix remain
compatible with the reusable workflow's input names so molecule.yml performs the
collection installation and bootstrap logic centrally.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/actions/collect-diagnostics/action.yml:
- Around line 224-233: The awk block that computes "recent" from "$dlog" drops
raw dmesg lines like "[12345.678]" because it only handles human-readable
timestamps; update the logic that builds "recent" to also accept raw dmesg
timestamps by obtaining the system boot epoch (e.g., btime from /proc/stat) and
passing it into the awk invocation so the parser can recognize
/^\[[0-9]+\.[0-9]+\]/, extract the seconds-since-boot, add btime to produce an
epoch, compare that to "cutoff", and still keep the existing human-readable
timestamp branch; ensure you reference the same variables used in the diff
("recent", "cutoff", "dlog") when wiring the btime value into awk so kernel OOM
lines aren’t dropped.

In `@molecule/default/molecule.yml`:
- Around line 2-12: Add a minimal Molecule verify playbook for the placeholder
scenario so CI coverage passes: create molecule/default/verify.yml that either
includes/extends the project's shared verify playbook or contains a tiny Ansible
play that runs the standard idempotence/connection checks (e.g., a hosts: all
play with a simple assert or ping task), ensuring it aligns with the existing
convention of reusing the shared verify logic rather than duplicating heavy
flows; reference the placeholder scenario (platform name "placeholder", prerun:
false, driver name "default") when adding the verify.yml so the coverage checker
recognizes and validates this scenario.

---

Nitpick comments:
In @.github/workflows/test_elasticsearch_upgrade.yml:
- Around line 88-104: The bootstrap "Install collection" block duplicates
collection setup that lives in the reusable workflow
.github/workflows/molecule.yml; replace the inline installation steps in
test_elasticsearch_upgrade.yml with a workflow_call to the reusable molecule.yml
workflow, forwarding the existing matrix and any required inputs (e.g.,
collection pins or CACHE_HOST) and environment vars so the reusable job executes
the collection install; remove the duplicated ansible-galaxy steps (including
the if/else CACHE_HOST logic) from this file and ensure the same change is
applied to the other duplicate block referenced (lines 203-219) so all Molecule
test workflows use the single reusable molecule.yml.

In @.github/workflows/test_full_stack.yml:
- Around line 116-132: Replace the duplicated "Install collection" step with a
call to the reusable Molecule workflow (.github/workflows/molecule.yml) so this
matrix is routed through that single source of truth: remove the block that
creates ANSIBLE_COLLECTIONS_PATH and the ansible-galaxy install commands and
instead invoke the reusable workflow (using workflow_call) passing the existing
matrix/inputs and secrets (e.g., CACHE_HOST) through; ensure the job/step names
referencing the matrix remain compatible with the reusable workflow's input
names so molecule.yml performs the collection installation and bootstrap logic
centrally.

In `@requirements-test.txt`:
- Around line 1-6: Update the temporary upper bound comment for the ansible-core
pin in requirements-test.txt (the line with "ansible-core>=2.18,<2.19") to
include a tracked removal condition — add an issue/ticket ID or a target date
and a short rationale so the pin is revisited; ensure the comment mentions the
tracker (e.g., "See ISSUE-XXXX" or "Remove by 2026-07-01 if tests pass") and
keeps the existing explanation about role_path scoping so future maintainers
know why the bound exists.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 165ab32e-cdd9-4dfb-b0f4-3d1aea6a326e

📥 Commits

Reviewing files that changed from the base of the PR and between 3bef767 and bfb3ef2.

📒 Files selected for processing (9)
  • .github/actions/collect-diagnostics/action.yml
  • .github/workflows/molecule.yml
  • .github/workflows/test_elasticsearch_upgrade.yml
  • .github/workflows/test_full_stack.yml
  • .github/workflows/test_linting.yml
  • molecule/default/molecule.yml
  • molecule/elasticsearch_diagnostics/converge.yml
  • molecule/logstash_custom_pipeline/verify.yml
  • requirements-test.txt

Comment on lines +224 to +233
recent=$(awk -v cutoff="$since_epoch" '
{ line = tolower($0) }
line ~ /oom-killer|out of memory|killed process/ {
if (match($0, /^\[[^]]+\]/)) {
ts = substr($0, RSTART+1, RLENGTH-2)
cmd = "date -d \"" ts "\" +%s 2>/dev/null"
if ((cmd | getline epoch) > 0 && (epoch+0) >= (cutoff+0)) print
close(cmd)
}
}' "$dlog")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle plain dmesg output in the new cutoff parser.

Lines 55-57 still fall back to raw dmesg, but this awk block only accepts human-readable timestamps that date -d can parse. On hosts where dmesg -T is unavailable, every raw [12345.678] OOM line is dropped here, so kernel OOMs disappear from the report. Please either capture a parseable timestamp at collection time or record enough host-boot metadata to translate raw seconds before applying the cutoff.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/actions/collect-diagnostics/action.yml around lines 224 - 233, The
awk block that computes "recent" from "$dlog" drops raw dmesg lines like
"[12345.678]" because it only handles human-readable timestamps; update the
logic that builds "recent" to also accept raw dmesg timestamps by obtaining the
system boot epoch (e.g., btime from /proc/stat) and passing it into the awk
invocation so the parser can recognize /^\[[0-9]+\.[0-9]+\]/, extract the
seconds-since-boot, add btime to produce an epoch, compare that to "cutoff", and
still keep the existing human-readable timestamp branch; ensure you reference
the same variables used in the diff ("recent", "cutoff", "dlog") when wiring the
btime value into awk so kernel OOM lines aren’t dropped.

Comment on lines +2 to +12
# Placeholder scenario. Molecule 26 globs molecule/default/molecule.yml on
# every run to discover shared state, and emits a CRITICAL line when it is
# absent. Every CI invocation passes -s <scenario>, so this scenario is
# never executed; the file only exists to silence that log noise.
prerun: false
driver:
name: default
platforms:
- name: placeholder
provisioner:
name: ansible
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

molecule/default is incomplete and currently breaks CI coverage checks.

The run is failing because this new scenario has no verify.yml. Please add a minimal molecule/default/verify.yml (or explicitly exclude this placeholder scenario from the coverage checker).

Minimal fix (new file)
+++ b/molecule/default/verify.yml
+---
+- name: Placeholder verify
+  hosts: all
+  gather_facts: false
+  tasks:
+    - name: Placeholder assertion
+      ansible.builtin.assert:
+        that:
+          - true

As per coding guidelines, "molecule/**: Verify molecule scenarios follow project conventions: prefer extending existing verify.yml over creating new scenarios (each adds ~10 min CI). Check that verify assertions are meaningful and will catch regressions."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Placeholder scenario. Molecule 26 globs molecule/default/molecule.yml on
# every run to discover shared state, and emits a CRITICAL line when it is
# absent. Every CI invocation passes -s <scenario>, so this scenario is
# never executed; the file only exists to silence that log noise.
prerun: false
driver:
name: default
platforms:
- name: placeholder
provisioner:
name: ansible
---
- name: Placeholder verify
hosts: all
gather_facts: false
tasks:
- name: Placeholder assertion
ansible.builtin.assert:
that:
- true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@molecule/default/molecule.yml` around lines 2 - 12, Add a minimal Molecule
verify playbook for the placeholder scenario so CI coverage passes: create
molecule/default/verify.yml that either includes/extends the project's shared
verify playbook or contains a tiny Ansible play that runs the standard
idempotence/connection checks (e.g., a hosts: all play with a simple assert or
ping task), ensuring it aligns with the existing convention of reusing the
shared verify logic rather than duplicating heavy flows; reference the
placeholder scenario (platform name "placeholder", prerun: false, driver name
"default") when adding the verify.yml so the coverage checker recognizes and
validates this scenario.

@Oddly Oddly merged commit 0e190b2 into main Apr 13, 2026
128 checks passed
@Oddly Oddly deleted the fix/ci-diagnostics-cleanup branch April 13, 2026 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant