fix(ci): unbreak diagnostics under ansible-core 2.19 and clean up CI noise by Oddly · Pull Request #128 · Oddly/elasticstack

Oddly · 2026-04-13T08:41:48Z

This branch gathers a handful of fixes that came out of investigating the scheduled runs from 13 April. The root causes are independent but the changes are small enough that bundling them felt cleaner than splitting across five PRs.

The elasticsearch_diagnostics scenario was reaching restart_and_verify_elasticsearch.yml via include_tasks with an absolute path, which stopped working in ansible-core 2.19 because role_path no longer leaks across play boundaries from a previous include_role. Switching to include_role with tasks_from restores the role context and drops the env-lookup path construction. I also pinned ansible-core below 2.19 in requirements-test.txt until the other ~50 {{ role_path }}/.. references in production role tasks are audited; the change converted the one we know about but the rest are not covered by the elasticsearch_diagnostics scenario and I want a clean run before re-opening the ceiling.

The collect-diagnostics action was crying wolf on OOMs. It pulled logs from every running container on the Incus host and scanned the shared kernel dmesg ring buffer without any time filter, so each job inherited logs from concurrent matrix siblings and weeks of historical host-level OOMs and the summary lit up red even when the failing run had no memory issue. It now filters containers by MOLECULE_RUN_SUFFIX and the OOM summary drops dmesg and ES log entries whose own timestamps predate a three-hour window. Cgroup memory.events are already per-container and stay as-is.

Dependabot PRs were bouncing off Test Linting because the ansible-galaxy install step referenced secrets.INCUS_HOST, which Dependabot does not inherit; the URL collapsed to http://:8082/... and the install DNS-failed. All four workflows that install from the cache now export CACHE_HOST and fall back to the pinned versions on upstream Galaxy when it is empty. Scheduled and manually-triggered runs keep using the internal cache.

Two cosmetic cleanups on top: a placeholder molecule/default/molecule.yml to silence the CRITICAL glob-failed line molecule 26 emits on every destroy, and a bump of the collect-diagnostics upload step to upload-artifact@v5 ahead of the Node 20 cutoff.

Finally, the logstash_custom_pipeline verify had a race where wait_for only checked for the output file's existence while Logstash was still flushing the generator events through batch; adding search_regex: processed_by makes it wait on actual content. That explains the sporadic fail on rockylinux9 + release 8 specifically, where the runner happened to be slow enough to tip it over.

The rockylinux9 + release 8 timeout on repos_default (converge 71s, idempotence cancelled at the 30m job timeout) is not addressed here. The scenario is minimal enough that the 27-minute idempotence run is almost certainly a host-level stall, but I want a second datapoint before tightening anything in the role. A workflow_dispatch of Test Role repos on this branch will either reproduce or clear it.

Summary by CodeRabbit

New Features
- Optional caching layer for installing Ansible collections; falls back to Galaxy if unavailable.
- Added placeholder Molecule scenario to avoid false missing-scenario reports.
Bug Fixes
- Improved container isolation in concurrent test jobs by suffix-based filtering.
- OOM detection now limits matches to recent timestamps (Elasticsearch and kernel logs).
- Extended Logstash verification wait to check content over a longer timeout.
Chores
- Tightened ansible-core test constraint.
- Upgraded artifact upload action to v5.
- Updated CI scenario exclusion list.

The scenario reached restart_and_verify_elasticsearch.yml via include_tasks with an absolute path, so it ran outside any role context. In 2.19 that stopped working because the included file resolves {{ role_path }} lazily, and role_path no longer leaks across play boundaries from the earlier include_role. Switching both callers to include_role with tasks_from re-establishes the role context and drops the env-lookup path construction along the way.

The OOM report was crying wolf because the collector pulled logs from every running container on the Incus host and the OOM detection itself scanned the shared kernel dmesg ring buffer without a time filter. A single matrix job therefore inherited both its concurrent siblings' logs and weeks of historical host-level OOMs, and the summary lit up red on runs that had no memory issue. Now the container discovery and the cgroup memory report both filter by MOLECULE_RUN_SUFFIX so we only look at containers this job created, and the OOM summary parses the bracketed timestamps on dmesg and ES log lines and drops anything older than three hours.

Dependabot pull requests do not inherit repo secrets, so INCUS_HOST expands to an empty string and the internal collection cache URL collapses to http://:8082/..., which the install step then fails on with a DNS error. Every dependabot PR has been bouncing off Test Linting as a result. Each of the four workflows that install collections from the cache now exports CACHE_HOST via env and switches to installing the same pinned community.general, community.crypto and ansible.posix versions from upstream Galaxy when CACHE_HOST is empty. Scheduled and manually-triggered runs continue to use the internal cache unchanged.

The previous <2.21 ceiling pulled in 2.19.8, which tightened variable scoping so that role_path no longer leaks across play boundaries from an earlier include_role. The elasticsearch_diagnostics scenario tripped over this, and there are roughly fifty other {{ role_path }}/.. references in production role tasks whose behaviour under 2.19 we have not yet validated end-to-end. Capping at <2.19 buys time to convert those to include_role with tasks_from (or equivalent) before letting 2.19 through again.

Molecule 26 globs molecule/default/molecule.yml for shared-state discovery on every invocation and emits a CRITICAL line when the file is missing, which turns up in every molecule destroy step in CI and makes the logs harder to scan when something actually goes wrong. Every workflow passes -s <scenario>, so this placeholder is never executed; it only exists so the glob succeeds.

The custom_pipeline verify task was waiting on the output file path only, but logstash opens the file when the pipeline starts and the generator events still have to flow through batch and flush before anything is written. On slower runners (rockylinux9 + release 8 in particular, where the whole job took 17m against 2-4m for the other forty-one combos) the slurp ran before the first event had been flushed, so the assertion saw an empty file and the scenario failed intermittently. Waiting on search_regex: processed_by removes the race: the task only returns once actual content has been written. I bumped the timeout to 120s since we are now legitimately waiting on pipeline work rather than a path appearing in the filesystem.

v4 runs on Node.js 20, which GitHub Actions is forcing off the runner on 2nd June 2026 and removing entirely on 16th September. v5 uses Node.js 24 and is otherwise API-compatible for our call sites; no workflow changes needed. The deprecation warning shows up in every failed matrix job right now, which is more log noise we can remove ahead of the cutoff.

coderabbitai · 2026-04-13T08:42:00Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c02bea6b-f515-4621-92ac-4cc3760388b8

📥 Commits

Reviewing files that changed from the base of the PR and between bfb3ef2 and 575ba49.

📒 Files selected for processing (1)

scripts/check-ci-coverage.sh

📝 Walkthrough

Walkthrough

Workflows now conditionally install Ansible collections from a cache host or upstream Galaxy; diagnostics action adds container suffix filtering and time-bounded OOM detection; Molecule playbooks switch task inclusion to role-based lookup and add a placeholder scenario; Ansible Core test constraint tightened; small script and verification timeouts adjusted.

Changes

Cohort / File(s)	Summary
Diagnostics & Container Management `.github/actions/collect-diagnostics/action.yml`	Adds `MOLECULE_RUN_SUFFIX` filtering for discovered Incus containers; OOM checks limit matches to entries within a 3-hour cutoff using timestamp parsing for Elasticsearch and dmesg; upgrades `actions/upload-artifact` from `v4` → `v5`.
Workflow Collection Installation `.github/workflows/molecule.yml`, `.github/workflows/test_elasticsearch_upgrade.yml`, `.github/workflows/test_full_stack.yml`, `.github/workflows/test_linting.yml`	Replaces unconditional tarball installs with a shell-if: set `CACHE_HOST` from `secrets.INCUS_HOST`; if non-empty use `http://${CACHE_HOST}:8082/...` tarballs, else install by Galaxy namespace/version (`community.general:12.3.0`, `community.crypto:3.1.1`, `ansible.posix:2.1.0`).
Molecule Scenarios & Tasks `molecule/default/molecule.yml`, `molecule/elasticsearch_diagnostics/converge.yml`	Adds placeholder Molecule scenario file; switches `include_tasks` (path-based) to `include_role` with `tasks_from` to load `restart_and_verify_elasticsearch.yml`.
Verification Timeout `molecule/logstash_custom_pipeline/verify.yml`	Replaces "file exists" check with content-based wait `search_regex: processed_by`, extends timeout from 60s → 120s, updates failure message.
Dependencies `requirements-test.txt`	Narrows `ansible-core` upper bound from `<2.21` to `<2.19` and adds explanatory comment about role_path/variable-scoping behavior in 2.19.
CI Coverage Script `scripts/check-ci-coverage.sh`	Adjusts `EXCLUDED_SCENARIOS` to exclude `default shared plugins` instead of only `shared plugins`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and specifically describes the main changes: fixing diagnostics compatibility with ansible-core 2.19 and reducing CI noise, which aligns with the core objectives of the pull request.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/ci-diagnostics-cleanup

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

requirements-test.txt (1)
1-6: Add a tracked removal condition for the temporary ansible-core upper bound.

Line 5 says “Revisit,” but there’s no issue/ticket/date anchor. Please add one so this compatibility pin does not become permanent unintentionally.
Suggested tweak
 # (the elasticsearch_diagnostics molecule scenario hit this). Revisit once
-# the full matrix is known to pass on 2.19.
+# the full matrix is known to pass on 2.19.
+# TODO(issue-XXX): remove this upper bound after role_path include-path audit.
 ansible-core>=2.18,<2.19
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@requirements-test.txt` around lines 1 - 6, Update the temporary upper bound
comment for the ansible-core pin in requirements-test.txt (the line with
"ansible-core>=2.18,<2.19") to include a tracked removal condition — add an
issue/ticket ID or a target date and a short rationale so the pin is revisited;
ensure the comment mentions the tracker (e.g., "See ISSUE-XXXX" or "Remove by
2026-07-01 if tests pass") and keeps the existing explanation about role_path
scoping so future maintainers know why the bound exists.
.github/workflows/test_elasticsearch_upgrade.yml (1)
88-104: Use the reusable Molecule workflow instead of copying this bootstrap block.

This branch is now another copy of the collection setup already owned by .github/workflows/molecule.yml, so the next collection pin change will need to be kept in sync across multiple files again. Keep the matrix here if needed, but move the job body behind the reusable workflow.

As per coding guidelines, "Molecule test workflows should use the reusable molecule.yml workflow."

Also applies to: 203-219
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test_elasticsearch_upgrade.yml around lines 88 - 104, The
bootstrap "Install collection" block duplicates collection setup that lives in
the reusable workflow .github/workflows/molecule.yml; replace the inline
installation steps in test_elasticsearch_upgrade.yml with a workflow_call to the
reusable molecule.yml workflow, forwarding the existing matrix and any required
inputs (e.g., collection pins or CACHE_HOST) and environment vars so the
reusable job executes the collection install; remove the duplicated
ansible-galaxy steps (including the if/else CACHE_HOST logic) from this file and
ensure the same change is applied to the other duplicate block referenced (lines
203-219) so all Molecule test workflows use the single reusable molecule.yml.
.github/workflows/test_full_stack.yml (1)
116-132: Route this matrix through .github/workflows/molecule.yml.

This install block is identical to the reusable workflow’s bootstrap logic, which brings back drift every time these collection pins change. Calling the reusable Molecule workflow here would keep one source of truth.

As per coding guidelines, "Molecule test workflows should use the reusable molecule.yml workflow."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test_full_stack.yml around lines 116 - 132, Replace the
duplicated "Install collection" step with a call to the reusable Molecule
workflow (.github/workflows/molecule.yml) so this matrix is routed through that
single source of truth: remove the block that creates ANSIBLE_COLLECTIONS_PATH
and the ansible-galaxy install commands and instead invoke the reusable workflow
(using workflow_call) passing the existing matrix/inputs and secrets (e.g.,
CACHE_HOST) through; ensure the job/step names referencing the matrix remain
compatible with the reusable workflow's input names so molecule.yml performs the
collection installation and bootstrap logic centrally.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/actions/collect-diagnostics/action.yml:
- Around line 224-233: The awk block that computes "recent" from "$dlog" drops
raw dmesg lines like "[12345.678]" because it only handles human-readable
timestamps; update the logic that builds "recent" to also accept raw dmesg
timestamps by obtaining the system boot epoch (e.g., btime from /proc/stat) and
passing it into the awk invocation so the parser can recognize
/^\[[0-9]+\.[0-9]+\]/, extract the seconds-since-boot, add btime to produce an
epoch, compare that to "cutoff", and still keep the existing human-readable
timestamp branch; ensure you reference the same variables used in the diff
("recent", "cutoff", "dlog") when wiring the btime value into awk so kernel OOM
lines aren’t dropped.

In `@molecule/default/molecule.yml`:
- Around line 2-12: Add a minimal Molecule verify playbook for the placeholder
scenario so CI coverage passes: create molecule/default/verify.yml that either
includes/extends the project's shared verify playbook or contains a tiny Ansible
play that runs the standard idempotence/connection checks (e.g., a hosts: all
play with a simple assert or ping task), ensuring it aligns with the existing
convention of reusing the shared verify logic rather than duplicating heavy
flows; reference the placeholder scenario (platform name "placeholder", prerun:
false, driver name "default") when adding the verify.yml so the coverage checker
recognizes and validates this scenario.

---

Nitpick comments:
In @.github/workflows/test_elasticsearch_upgrade.yml:
- Around line 88-104: The bootstrap "Install collection" block duplicates
collection setup that lives in the reusable workflow
.github/workflows/molecule.yml; replace the inline installation steps in
test_elasticsearch_upgrade.yml with a workflow_call to the reusable molecule.yml
workflow, forwarding the existing matrix and any required inputs (e.g.,
collection pins or CACHE_HOST) and environment vars so the reusable job executes
the collection install; remove the duplicated ansible-galaxy steps (including
the if/else CACHE_HOST logic) from this file and ensure the same change is
applied to the other duplicate block referenced (lines 203-219) so all Molecule
test workflows use the single reusable molecule.yml.

In @.github/workflows/test_full_stack.yml:
- Around line 116-132: Replace the duplicated "Install collection" step with a
call to the reusable Molecule workflow (.github/workflows/molecule.yml) so this
matrix is routed through that single source of truth: remove the block that
creates ANSIBLE_COLLECTIONS_PATH and the ansible-galaxy install commands and
instead invoke the reusable workflow (using workflow_call) passing the existing
matrix/inputs and secrets (e.g., CACHE_HOST) through; ensure the job/step names
referencing the matrix remain compatible with the reusable workflow's input
names so molecule.yml performs the collection installation and bootstrap logic
centrally.

In `@requirements-test.txt`:
- Around line 1-6: Update the temporary upper bound comment for the ansible-core
pin in requirements-test.txt (the line with "ansible-core>=2.18,<2.19") to
include a tracked removal condition — add an issue/ticket ID or a target date
and a short rationale so the pin is revisited; ensure the comment mentions the
tracker (e.g., "See ISSUE-XXXX" or "Remove by 2026-07-01 if tests pass") and
keeps the existing explanation about role_path scoping so future maintainers
know why the bound exists.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 165ab32e-cdd9-4dfb-b0f4-3d1aea6a326e

📥 Commits

Reviewing files that changed from the base of the PR and between 3bef767 and bfb3ef2.

📒 Files selected for processing (9)

.github/actions/collect-diagnostics/action.yml
.github/workflows/molecule.yml
.github/workflows/test_elasticsearch_upgrade.yml
.github/workflows/test_full_stack.yml
.github/workflows/test_linting.yml
molecule/default/molecule.yml
molecule/elasticsearch_diagnostics/converge.yml
molecule/logstash_custom_pipeline/verify.yml
requirements-test.txt

coderabbitai · 2026-04-13T08:47:07Z

+            recent=$(awk -v cutoff="$since_epoch" '
+              { line = tolower($0) }
+              line ~ /oom-killer|out of memory|killed process/ {
+                if (match($0, /^\[[^]]+\]/)) {
+                  ts = substr($0, RSTART+1, RLENGTH-2)
+                  cmd = "date -d \"" ts "\" +%s 2>/dev/null"
+                  if ((cmd | getline epoch) > 0 && (epoch+0) >= (cutoff+0)) print
+                  close(cmd)
+                }
+              }' "$dlog")


⚠️ Potential issue | 🟠 Major

Handle plain dmesg output in the new cutoff parser.

Lines 55-57 still fall back to raw dmesg, but this awk block only accepts human-readable timestamps that date -d can parse. On hosts where dmesg -T is unavailable, every raw [12345.678] OOM line is dropped here, so kernel OOMs disappear from the report. Please either capture a parseable timestamp at collection time or record enough host-boot metadata to translate raw seconds before applying the cutoff.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/actions/collect-diagnostics/action.yml around lines 224 - 233, The awk block that computes "recent" from "$dlog" drops raw dmesg lines like "[12345.678]" because it only handles human-readable timestamps; update the logic that builds "recent" to also accept raw dmesg timestamps by obtaining the system boot epoch (e.g., btime from /proc/stat) and passing it into the awk invocation so the parser can recognize /^\[[0-9]+\.[0-9]+\]/, extract the seconds-since-boot, add btime to produce an epoch, compare that to "cutoff", and still keep the existing human-readable timestamp branch; ensure you reference the same variables used in the diff ("recent", "cutoff", "dlog") when wiring the btime value into awk so kernel OOM lines aren’t dropped.

coderabbitai · 2026-04-13T08:47:07Z

+# Placeholder scenario. Molecule 26 globs molecule/default/molecule.yml on
+# every run to discover shared state, and emits a CRITICAL line when it is
+# absent. Every CI invocation passes -s <scenario>, so this scenario is
+# never executed; the file only exists to silence that log noise.
+prerun: false
+driver:
+  name: default
+platforms:
+  - name: placeholder
+provisioner:
+  name: ansible


⚠️ Potential issue | 🟠 Major

molecule/default is incomplete and currently breaks CI coverage checks.

The run is failing because this new scenario has no verify.yml. Please add a minimal molecule/default/verify.yml (or explicitly exclude this placeholder scenario from the coverage checker).

Minimal fix (new file)

+++ b/molecule/default/verify.yml +--- +- name: Placeholder verify + hosts: all + gather_facts: false + tasks: + - name: Placeholder assertion + ansible.builtin.assert: + that: + - true

As per coding guidelines, "molecule/**: Verify molecule scenarios follow project conventions: prefer extending existing verify.yml over creating new scenarios (each adds ~10 min CI). Check that verify assertions are meaningful and will catch regressions."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Placeholder scenario. Molecule 26 globs molecule/default/molecule.yml on

# every run to discover shared state, and emits a CRITICAL line when it is

# absent. Every CI invocation passes -s <scenario>, so this scenario is

# never executed; the file only exists to silence that log noise.

prerun: false

driver:

name: default

platforms:

- name: placeholder

provisioner:

name: ansible

---

- name: Placeholder verify

hosts: all

gather_facts: false

tasks:

- name: Placeholder assertion

ansible.builtin.assert:

that:

- true

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@molecule/default/molecule.yml` around lines 2 - 12, Add a minimal Molecule verify playbook for the placeholder scenario so CI coverage passes: create molecule/default/verify.yml that either includes/extends the project's shared verify playbook or contains a tiny Ansible play that runs the standard idempotence/connection checks (e.g., a hosts: all play with a simple assert or ping task), ensuring it aligns with the existing convention of reusing the shared verify logic rather than duplicating heavy flows; reference the placeholder scenario (platform name "placeholder", prerun: false, driver name "default") when adding the verify.yml so the coverage checker recognizes and validates this scenario.

Oddly added 7 commits April 13, 2026 10:25

coderabbitai Bot reviewed Apr 13, 2026

View reviewed changes

fix(ci): exclude placeholder molecule scenario from coverage

575ba49

Oddly merged commit 0e190b2 into main Apr 13, 2026
128 checks passed

Oddly deleted the fix/ci-diagnostics-cleanup branch April 13, 2026 12:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci): unbreak diagnostics under ansible-core 2.19 and clean up CI noise#128

fix(ci): unbreak diagnostics under ansible-core 2.19 and clean up CI noise#128
Oddly merged 8 commits intomainfrom
fix/ci-diagnostics-cleanup

Oddly commented Apr 13, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 13, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Uh oh!

coderabbitai Bot Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Oddly commented Apr 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Oddly commented Apr 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 13, 2026 •

edited

Loading