fix(ci): unbreak diagnostics under ansible-core 2.19 and clean up CI noise#128
fix(ci): unbreak diagnostics under ansible-core 2.19 and clean up CI noise#128
Conversation
The scenario reached restart_and_verify_elasticsearch.yml via include_tasks with
an absolute path, so it ran outside any role context. In 2.19 that stopped
working because the included file resolves {{ role_path }} lazily, and role_path
no longer leaks across play boundaries from the earlier include_role. Switching
both callers to include_role with tasks_from re-establishes the role context
and drops the env-lookup path construction along the way.
The OOM report was crying wolf because the collector pulled logs from every running container on the Incus host and the OOM detection itself scanned the shared kernel dmesg ring buffer without a time filter. A single matrix job therefore inherited both its concurrent siblings' logs and weeks of historical host-level OOMs, and the summary lit up red on runs that had no memory issue. Now the container discovery and the cgroup memory report both filter by MOLECULE_RUN_SUFFIX so we only look at containers this job created, and the OOM summary parses the bracketed timestamps on dmesg and ES log lines and drops anything older than three hours.
Dependabot pull requests do not inherit repo secrets, so INCUS_HOST expands to an empty string and the internal collection cache URL collapses to http://:8082/..., which the install step then fails on with a DNS error. Every dependabot PR has been bouncing off Test Linting as a result. Each of the four workflows that install collections from the cache now exports CACHE_HOST via env and switches to installing the same pinned community.general, community.crypto and ansible.posix versions from upstream Galaxy when CACHE_HOST is empty. Scheduled and manually-triggered runs continue to use the internal cache unchanged.
The previous <2.21 ceiling pulled in 2.19.8, which tightened variable
scoping so that role_path no longer leaks across play boundaries from an
earlier include_role. The elasticsearch_diagnostics scenario tripped over
this, and there are roughly fifty other {{ role_path }}/.. references in
production role tasks whose behaviour under 2.19 we have not yet validated
end-to-end. Capping at <2.19 buys time to convert those to include_role
with tasks_from (or equivalent) before letting 2.19 through again.
Molecule 26 globs molecule/default/molecule.yml for shared-state discovery on every invocation and emits a CRITICAL line when the file is missing, which turns up in every molecule destroy step in CI and makes the logs harder to scan when something actually goes wrong. Every workflow passes -s <scenario>, so this placeholder is never executed; it only exists so the glob succeeds.
The custom_pipeline verify task was waiting on the output file path only, but logstash opens the file when the pipeline starts and the generator events still have to flow through batch and flush before anything is written. On slower runners (rockylinux9 + release 8 in particular, where the whole job took 17m against 2-4m for the other forty-one combos) the slurp ran before the first event had been flushed, so the assertion saw an empty file and the scenario failed intermittently. Waiting on search_regex: processed_by removes the race: the task only returns once actual content has been written. I bumped the timeout to 120s since we are now legitimately waiting on pipeline work rather than a path appearing in the filesystem.
v4 runs on Node.js 20, which GitHub Actions is forcing off the runner on 2nd June 2026 and removing entirely on 16th September. v5 uses Node.js 24 and is otherwise API-compatible for our call sites; no workflow changes needed. The deprecation warning shows up in every failed matrix job right now, which is more log noise we can remove ahead of the cutoff.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughWorkflows now conditionally install Ansible collections from a cache host or upstream Galaxy; diagnostics action adds container suffix filtering and time-bounded OOM detection; Molecule playbooks switch task inclusion to role-based lookup and add a placeholder scenario; Ansible Core test constraint tightened; small script and verification timeouts adjusted. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
requirements-test.txt (1)
1-6: Add a tracked removal condition for the temporaryansible-coreupper bound.Line 5 says “Revisit,” but there’s no issue/ticket/date anchor. Please add one so this compatibility pin does not become permanent unintentionally.
Suggested tweak
# (the elasticsearch_diagnostics molecule scenario hit this). Revisit once -# the full matrix is known to pass on 2.19. +# the full matrix is known to pass on 2.19. +# TODO(issue-XXX): remove this upper bound after role_path include-path audit. ansible-core>=2.18,<2.19🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@requirements-test.txt` around lines 1 - 6, Update the temporary upper bound comment for the ansible-core pin in requirements-test.txt (the line with "ansible-core>=2.18,<2.19") to include a tracked removal condition — add an issue/ticket ID or a target date and a short rationale so the pin is revisited; ensure the comment mentions the tracker (e.g., "See ISSUE-XXXX" or "Remove by 2026-07-01 if tests pass") and keeps the existing explanation about role_path scoping so future maintainers know why the bound exists..github/workflows/test_elasticsearch_upgrade.yml (1)
88-104: Use the reusable Molecule workflow instead of copying this bootstrap block.This branch is now another copy of the collection setup already owned by
.github/workflows/molecule.yml, so the next collection pin change will need to be kept in sync across multiple files again. Keep the matrix here if needed, but move the job body behind the reusable workflow.As per coding guidelines, "Molecule test workflows should use the reusable molecule.yml workflow."
Also applies to: 203-219
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/test_elasticsearch_upgrade.yml around lines 88 - 104, The bootstrap "Install collection" block duplicates collection setup that lives in the reusable workflow .github/workflows/molecule.yml; replace the inline installation steps in test_elasticsearch_upgrade.yml with a workflow_call to the reusable molecule.yml workflow, forwarding the existing matrix and any required inputs (e.g., collection pins or CACHE_HOST) and environment vars so the reusable job executes the collection install; remove the duplicated ansible-galaxy steps (including the if/else CACHE_HOST logic) from this file and ensure the same change is applied to the other duplicate block referenced (lines 203-219) so all Molecule test workflows use the single reusable molecule.yml..github/workflows/test_full_stack.yml (1)
116-132: Route this matrix through.github/workflows/molecule.yml.This install block is identical to the reusable workflow’s bootstrap logic, which brings back drift every time these collection pins change. Calling the reusable Molecule workflow here would keep one source of truth.
As per coding guidelines, "Molecule test workflows should use the reusable molecule.yml workflow."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/test_full_stack.yml around lines 116 - 132, Replace the duplicated "Install collection" step with a call to the reusable Molecule workflow (.github/workflows/molecule.yml) so this matrix is routed through that single source of truth: remove the block that creates ANSIBLE_COLLECTIONS_PATH and the ansible-galaxy install commands and instead invoke the reusable workflow (using workflow_call) passing the existing matrix/inputs and secrets (e.g., CACHE_HOST) through; ensure the job/step names referencing the matrix remain compatible with the reusable workflow's input names so molecule.yml performs the collection installation and bootstrap logic centrally.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/actions/collect-diagnostics/action.yml:
- Around line 224-233: The awk block that computes "recent" from "$dlog" drops
raw dmesg lines like "[12345.678]" because it only handles human-readable
timestamps; update the logic that builds "recent" to also accept raw dmesg
timestamps by obtaining the system boot epoch (e.g., btime from /proc/stat) and
passing it into the awk invocation so the parser can recognize
/^\[[0-9]+\.[0-9]+\]/, extract the seconds-since-boot, add btime to produce an
epoch, compare that to "cutoff", and still keep the existing human-readable
timestamp branch; ensure you reference the same variables used in the diff
("recent", "cutoff", "dlog") when wiring the btime value into awk so kernel OOM
lines aren’t dropped.
In `@molecule/default/molecule.yml`:
- Around line 2-12: Add a minimal Molecule verify playbook for the placeholder
scenario so CI coverage passes: create molecule/default/verify.yml that either
includes/extends the project's shared verify playbook or contains a tiny Ansible
play that runs the standard idempotence/connection checks (e.g., a hosts: all
play with a simple assert or ping task), ensuring it aligns with the existing
convention of reusing the shared verify logic rather than duplicating heavy
flows; reference the placeholder scenario (platform name "placeholder", prerun:
false, driver name "default") when adding the verify.yml so the coverage checker
recognizes and validates this scenario.
---
Nitpick comments:
In @.github/workflows/test_elasticsearch_upgrade.yml:
- Around line 88-104: The bootstrap "Install collection" block duplicates
collection setup that lives in the reusable workflow
.github/workflows/molecule.yml; replace the inline installation steps in
test_elasticsearch_upgrade.yml with a workflow_call to the reusable molecule.yml
workflow, forwarding the existing matrix and any required inputs (e.g.,
collection pins or CACHE_HOST) and environment vars so the reusable job executes
the collection install; remove the duplicated ansible-galaxy steps (including
the if/else CACHE_HOST logic) from this file and ensure the same change is
applied to the other duplicate block referenced (lines 203-219) so all Molecule
test workflows use the single reusable molecule.yml.
In @.github/workflows/test_full_stack.yml:
- Around line 116-132: Replace the duplicated "Install collection" step with a
call to the reusable Molecule workflow (.github/workflows/molecule.yml) so this
matrix is routed through that single source of truth: remove the block that
creates ANSIBLE_COLLECTIONS_PATH and the ansible-galaxy install commands and
instead invoke the reusable workflow (using workflow_call) passing the existing
matrix/inputs and secrets (e.g., CACHE_HOST) through; ensure the job/step names
referencing the matrix remain compatible with the reusable workflow's input
names so molecule.yml performs the collection installation and bootstrap logic
centrally.
In `@requirements-test.txt`:
- Around line 1-6: Update the temporary upper bound comment for the ansible-core
pin in requirements-test.txt (the line with "ansible-core>=2.18,<2.19") to
include a tracked removal condition — add an issue/ticket ID or a target date
and a short rationale so the pin is revisited; ensure the comment mentions the
tracker (e.g., "See ISSUE-XXXX" or "Remove by 2026-07-01 if tests pass") and
keeps the existing explanation about role_path scoping so future maintainers
know why the bound exists.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 165ab32e-cdd9-4dfb-b0f4-3d1aea6a326e
📒 Files selected for processing (9)
.github/actions/collect-diagnostics/action.yml.github/workflows/molecule.yml.github/workflows/test_elasticsearch_upgrade.yml.github/workflows/test_full_stack.yml.github/workflows/test_linting.ymlmolecule/default/molecule.ymlmolecule/elasticsearch_diagnostics/converge.ymlmolecule/logstash_custom_pipeline/verify.ymlrequirements-test.txt
| recent=$(awk -v cutoff="$since_epoch" ' | ||
| { line = tolower($0) } | ||
| line ~ /oom-killer|out of memory|killed process/ { | ||
| if (match($0, /^\[[^]]+\]/)) { | ||
| ts = substr($0, RSTART+1, RLENGTH-2) | ||
| cmd = "date -d \"" ts "\" +%s 2>/dev/null" | ||
| if ((cmd | getline epoch) > 0 && (epoch+0) >= (cutoff+0)) print | ||
| close(cmd) | ||
| } | ||
| }' "$dlog") |
There was a problem hiding this comment.
Handle plain dmesg output in the new cutoff parser.
Lines 55-57 still fall back to raw dmesg, but this awk block only accepts human-readable timestamps that date -d can parse. On hosts where dmesg -T is unavailable, every raw [12345.678] OOM line is dropped here, so kernel OOMs disappear from the report. Please either capture a parseable timestamp at collection time or record enough host-boot metadata to translate raw seconds before applying the cutoff.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/actions/collect-diagnostics/action.yml around lines 224 - 233, The
awk block that computes "recent" from "$dlog" drops raw dmesg lines like
"[12345.678]" because it only handles human-readable timestamps; update the
logic that builds "recent" to also accept raw dmesg timestamps by obtaining the
system boot epoch (e.g., btime from /proc/stat) and passing it into the awk
invocation so the parser can recognize /^\[[0-9]+\.[0-9]+\]/, extract the
seconds-since-boot, add btime to produce an epoch, compare that to "cutoff", and
still keep the existing human-readable timestamp branch; ensure you reference
the same variables used in the diff ("recent", "cutoff", "dlog") when wiring the
btime value into awk so kernel OOM lines aren’t dropped.
| # Placeholder scenario. Molecule 26 globs molecule/default/molecule.yml on | ||
| # every run to discover shared state, and emits a CRITICAL line when it is | ||
| # absent. Every CI invocation passes -s <scenario>, so this scenario is | ||
| # never executed; the file only exists to silence that log noise. | ||
| prerun: false | ||
| driver: | ||
| name: default | ||
| platforms: | ||
| - name: placeholder | ||
| provisioner: | ||
| name: ansible |
There was a problem hiding this comment.
molecule/default is incomplete and currently breaks CI coverage checks.
The run is failing because this new scenario has no verify.yml. Please add a minimal molecule/default/verify.yml (or explicitly exclude this placeholder scenario from the coverage checker).
Minimal fix (new file)
+++ b/molecule/default/verify.yml
+---
+- name: Placeholder verify
+ hosts: all
+ gather_facts: false
+ tasks:
+ - name: Placeholder assertion
+ ansible.builtin.assert:
+ that:
+ - trueAs per coding guidelines, "molecule/**: Verify molecule scenarios follow project conventions: prefer extending existing verify.yml over creating new scenarios (each adds ~10 min CI). Check that verify assertions are meaningful and will catch regressions."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Placeholder scenario. Molecule 26 globs molecule/default/molecule.yml on | |
| # every run to discover shared state, and emits a CRITICAL line when it is | |
| # absent. Every CI invocation passes -s <scenario>, so this scenario is | |
| # never executed; the file only exists to silence that log noise. | |
| prerun: false | |
| driver: | |
| name: default | |
| platforms: | |
| - name: placeholder | |
| provisioner: | |
| name: ansible | |
| --- | |
| - name: Placeholder verify | |
| hosts: all | |
| gather_facts: false | |
| tasks: | |
| - name: Placeholder assertion | |
| ansible.builtin.assert: | |
| that: | |
| - true |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@molecule/default/molecule.yml` around lines 2 - 12, Add a minimal Molecule
verify playbook for the placeholder scenario so CI coverage passes: create
molecule/default/verify.yml that either includes/extends the project's shared
verify playbook or contains a tiny Ansible play that runs the standard
idempotence/connection checks (e.g., a hosts: all play with a simple assert or
ping task), ensuring it aligns with the existing convention of reusing the
shared verify logic rather than duplicating heavy flows; reference the
placeholder scenario (platform name "placeholder", prerun: false, driver name
"default") when adding the verify.yml so the coverage checker recognizes and
validates this scenario.
This branch gathers a handful of fixes that came out of investigating the scheduled runs from 13 April. The root causes are independent but the changes are small enough that bundling them felt cleaner than splitting across five PRs.
The elasticsearch_diagnostics scenario was reaching restart_and_verify_elasticsearch.yml via include_tasks with an absolute path, which stopped working in ansible-core 2.19 because role_path no longer leaks across play boundaries from a previous include_role. Switching to include_role with tasks_from restores the role context and drops the env-lookup path construction. I also pinned ansible-core below 2.19 in requirements-test.txt until the other ~50 {{ role_path }}/.. references in production role tasks are audited; the change converted the one we know about but the rest are not covered by the elasticsearch_diagnostics scenario and I want a clean run before re-opening the ceiling.
The collect-diagnostics action was crying wolf on OOMs. It pulled logs from every running container on the Incus host and scanned the shared kernel dmesg ring buffer without any time filter, so each job inherited logs from concurrent matrix siblings and weeks of historical host-level OOMs and the summary lit up red even when the failing run had no memory issue. It now filters containers by MOLECULE_RUN_SUFFIX and the OOM summary drops dmesg and ES log entries whose own timestamps predate a three-hour window. Cgroup memory.events are already per-container and stay as-is.
Dependabot PRs were bouncing off Test Linting because the ansible-galaxy install step referenced secrets.INCUS_HOST, which Dependabot does not inherit; the URL collapsed to http://:8082/... and the install DNS-failed. All four workflows that install from the cache now export CACHE_HOST and fall back to the pinned versions on upstream Galaxy when it is empty. Scheduled and manually-triggered runs keep using the internal cache.
Two cosmetic cleanups on top: a placeholder molecule/default/molecule.yml to silence the CRITICAL glob-failed line molecule 26 emits on every destroy, and a bump of the collect-diagnostics upload step to upload-artifact@v5 ahead of the Node 20 cutoff.
Finally, the logstash_custom_pipeline verify had a race where wait_for only checked for the output file's existence while Logstash was still flushing the generator events through batch; adding search_regex: processed_by makes it wait on actual content. That explains the sporadic fail on rockylinux9 + release 8 specifically, where the runner happened to be slow enough to tip it over.
The rockylinux9 + release 8 timeout on repos_default (converge 71s, idempotence cancelled at the 30m job timeout) is not addressed here. The scenario is minimal enough that the 27-minute idempotence run is almost certainly a host-level stall, but I want a second datapoint before tightening anything in the role. A workflow_dispatch of Test Role repos on this branch will either reproduce or clear it.
Summary by CodeRabbit
New Features
Bug Fixes
Chores