Role-wide 9.x compatibility review and fixes#31
Merged
Conversation
The post-upgrade health check after re-enabling shard allocation had only 5 retries (2.5 min), which is far too short for clusters with significant data. Bumped to 30 retries (15 min) to match the pre-upgrade health check's intent. Rewrote the single-node upgrade scenario to use the elasticsearch role for the 8.19→9.x step instead of manual package/restart. The old approach bypassed the role entirely, leaving its single-node upgrade code path untested by CI. The new version mirrors the multi-node scenario's pattern: reset the shared role guard, detect the candidate version, set elasticstack_version, and include the role. Added rockylinux9 to the single-node upgrade matrix and a push trigger on upgrade-related file paths so changes to the rolling upgrade flow are caught immediately rather than waiting for the Sunday schedule.
The docker input type is deprecated in Beats 9.x. When elasticstack_release is 9 or higher, the template now renders a filestream input with the container parser instead, which handles both Docker JSON and CRI log formats. The old docker input is preserved for 8.x deployments.
Guard xpack.security.enabled: false so it's only written for ES < 9 since 9.x removed the ability to disable security entirely and rejects this setting. Guard Kibana sniff settings (sniffOnStart, sniffOnConnectionFault, sniffInterval) for < 9 since these were removed in Kibana 9.x. Add missing notify on the Logstash pipelines.yml template task so changes to queue type or pipeline definitions actually trigger a service restart. Add missing check_mode guards to Logstash and Beats handlers (consistent with ES and Kibana handlers), and add kibana_enable guard to the Kibana handler (consistent with ES and Logstash handlers).
dnf info can return multiple Version lines (e.g., for x86_64 and noarch variants). The shell command produced a doubled version string like "9.3.1\n9.3.1" which then failed package installation. Adding head -1 ensures only the first version line is used.
api.http.host/api.http.port have been available since Logstash 8.0, so there's no need to branch between http.* and api.http.* — just use api.http.* unconditionally. The filestream input type has been GA since Filebeat 7.14 and is the recommended replacement for both the log and docker input types, which were deprecated in 8.x and removed in 9.x. Using filestream unconditionally removes the version branches and halves the template size for log inputs. The remaining version conditionals (log4j2 ECSJsonLayout, Logstash plugin SSL renames, xpack.security.enabled guard, Kibana sniff settings, monitoring allow_legacy_collection) are genuinely 9.x-specific and need to stay.
Restricts the GITHUB_TOKEN to read-only content access across all 14 workflow files. Resolves 20 code scanning alerts for missing-workflow-permissions.
This was referenced Mar 1, 2026
PR and merge_group triggers now test only the latest distro per OS family (rockylinux10/debian13, or rockylinux9/debian13 for full-stack which excludes rocky10 due to missing EPEL Redis). Schedule and workflow_dispatch triggers keep the full 7-distro matrix. Added daily schedule cron to all role-specific workflows so the full matrix still runs every night. Also unified version selection across ES, Kibana, and Logstash to handle elasticstack_version: latest the same way beats already did — the construct tasks skip the version suffix and install tasks pass state: latest. The rolling upgrade guards now skip when version is 'latest' since version comparison would fail on a non-semver string.
Oddly
added a commit
that referenced
this pull request
Mar 1, 2026
Added security-events: write to kics.yml permissions so the SARIF upload step can actually push results to GitHub code scanning. This broke silently after #31 added permissions: contents: read. Moved the upgrade test from weekly-on-Sunday plus push-to-main to a daily schedule. Running on push was too late to catch regressions since PRs never trigger these tests, and weekly was too infrequent. Removed tests/unit/plugins/module_utils/test_api.py which imports plugins/module_utils/api.py — a module that doesn't exist in the repo. The test was never run by CI anyway. Fixed test_cert_info.py to skip date field assertions since the fixture P12 cert has a hardcoded expiry of 2026-03-28 that would cause the test to fail once the cert expires. Updated plugins/modules/README.md tested versions to match current CI reality (cryptography 42-46, Python 3.11-3.12, ansible-core 2.18-2.20).
Oddly
added a commit
that referenced
this pull request
Mar 1, 2026
Added security-events: write to kics.yml permissions so the SARIF upload step can actually push results to GitHub code scanning. This broke silently after #31 added permissions: contents: read. Moved the upgrade test from weekly-on-Sunday plus push-to-main to a daily schedule. Running on push was too late to catch regressions since PRs never trigger these tests, and weekly was too infrequent. Removed tests/unit/plugins/module_utils/test_api.py which imports plugins/module_utils/api.py — a module that doesn't exist in the repo. The test was never run by CI anyway. Fixed test_cert_info.py to skip date field assertions since the fixture P12 cert has a hardcoded expiry of 2026-03-28 that would cause the test to fail once the cert expires. Updated plugins/modules/README.md tested versions to match current CI reality (cryptography 42-46, Python 3.11-3.12, ansible-core 2.18-2.20).
Oddly
added a commit
that referenced
this pull request
Mar 3, 2026
Added security-events: write to kics.yml permissions so the SARIF upload step can actually push results to GitHub code scanning. This broke silently after #31 added permissions: contents: read. Moved the upgrade test from weekly-on-Sunday plus push-to-main to a daily schedule. Running on push was too late to catch regressions since PRs never trigger these tests, and weekly was too infrequent. Removed tests/unit/plugins/module_utils/test_api.py which imports plugins/module_utils/api.py — a module that doesn't exist in the repo. The test was never run by CI anyway. Fixed test_cert_info.py to skip date field assertions since the fixture P12 cert has a hardcoded expiry of 2026-03-28 that would cause the test to fail once the cert expires. Updated plugins/modules/README.md tested versions to match current CI reality (cryptography 42-46, Python 3.11-3.12, ansible-core 2.18-2.20).
Oddly
added a commit
that referenced
this pull request
Mar 3, 2026
Added security-events: write to kics.yml permissions so the SARIF upload step can actually push results to GitHub code scanning. This broke silently after #31 added permissions: contents: read. Moved the upgrade test from weekly-on-Sunday plus push-to-main to a daily schedule. Running on push was too late to catch regressions since PRs never trigger these tests, and weekly was too infrequent. Removed tests/unit/plugins/module_utils/test_api.py which imports plugins/module_utils/api.py — a module that doesn't exist in the repo. The test was never run by CI anyway. Fixed test_cert_info.py to skip date field assertions since the fixture P12 cert has a hardcoded expiry of 2026-03-28 that would cause the test to fail once the cert expires. Updated plugins/modules/README.md tested versions to match current CI reality (cryptography 42-46, Python 3.11-3.12, ansible-core 2.18-2.20).
Oddly
added a commit
that referenced
this pull request
Mar 3, 2026
Added security-events: write to kics.yml permissions so the SARIF upload step can actually push results to GitHub code scanning. This broke silently after #31 added permissions: contents: read. Moved the upgrade test from weekly-on-Sunday plus push-to-main to a daily schedule. Running on push was too late to catch regressions since PRs never trigger these tests, and weekly was too infrequent. Removed tests/unit/plugins/module_utils/test_api.py which imports plugins/module_utils/api.py — a module that doesn't exist in the repo. The test was never run by CI anyway. Fixed test_cert_info.py to skip date field assertions since the fixture P12 cert has a hardcoded expiry of 2026-03-28 that would cause the test to fail once the cert expires. Updated plugins/modules/README.md tested versions to match current CI reality (cryptography 42-46, Python 3.11-3.12, ansible-core 2.18-2.20).
Oddly
added a commit
that referenced
this pull request
Mar 3, 2026
Added security-events: write to kics.yml permissions so the SARIF upload step can actually push results to GitHub code scanning. This broke silently after #31 added permissions: contents: read. Moved the upgrade test from weekly-on-Sunday plus push-to-main to a daily schedule. Running on push was too late to catch regressions since PRs never trigger these tests, and weekly was too infrequent. Removed tests/unit/plugins/module_utils/test_api.py which imports plugins/module_utils/api.py — a module that doesn't exist in the repo. The test was never run by CI anyway. Fixed test_cert_info.py to skip date field assertions since the fixture P12 cert has a hardcoded expiry of 2026-03-28 that would cause the test to fail once the cert expires. Updated plugins/modules/README.md tested versions to match current CI reality (cryptography 42-46, Python 3.11-3.12, ansible-core 2.18-2.20).
Oddly
added a commit
that referenced
this pull request
Mar 4, 2026
Added security-events: write to kics.yml permissions so the SARIF upload step can actually push results to GitHub code scanning. This broke silently after #31 added permissions: contents: read. Moved the upgrade test from weekly-on-Sunday plus push-to-main to a daily schedule. Running on push was too late to catch regressions since PRs never trigger these tests, and weekly was too infrequent. Removed tests/unit/plugins/module_utils/test_api.py which imports plugins/module_utils/api.py — a module that doesn't exist in the repo. The test was never run by CI anyway. Fixed test_cert_info.py to skip date field assertions since the fixture P12 cert has a hardcoded expiry of 2026-03-28 that would cause the test to fail once the cert expires. Updated plugins/modules/README.md tested versions to match current CI reality (cryptography 42-46, Python 3.11-3.12, ansible-core 2.18-2.20).
Oddly
added a commit
that referenced
this pull request
Mar 4, 2026
Added security-events: write to kics.yml permissions so the SARIF upload step can actually push results to GitHub code scanning. This broke silently after #31 added permissions: contents: read. Moved the upgrade test from weekly-on-Sunday plus push-to-main to a daily schedule. Running on push was too late to catch regressions since PRs never trigger these tests, and weekly was too infrequent. Removed tests/unit/plugins/module_utils/test_api.py which imports plugins/module_utils/api.py — a module that doesn't exist in the repo. The test was never run by CI anyway. Fixed test_cert_info.py to skip date field assertions since the fixture P12 cert has a hardcoded expiry of 2026-03-28 that would cause the test to fail once the cert expires. Updated plugins/modules/README.md tested versions to match current CI reality (cryptography 42-46, Python 3.11-3.12, ansible-core 2.18-2.20).
Oddly
added a commit
that referenced
this pull request
Mar 4, 2026
Added security-events: write to kics.yml permissions so the SARIF upload step can actually push results to GitHub code scanning. This broke silently after #31 added permissions: contents: read. Moved the upgrade test from weekly-on-Sunday plus push-to-main to a daily schedule. Running on push was too late to catch regressions since PRs never trigger these tests, and weekly was too infrequent. Removed tests/unit/plugins/module_utils/test_api.py which imports plugins/module_utils/api.py — a module that doesn't exist in the repo. The test was never run by CI anyway. Fixed test_cert_info.py to skip date field assertions since the fixture P12 cert has a hardcoded expiry of 2026-03-28 that would cause the test to fail once the cert expires. Updated plugins/modules/README.md tested versions to match current CI reality (cryptography 42-46, Python 3.11-3.12, ansible-core 2.18-2.20).
Oddly
added a commit
that referenced
this pull request
Mar 4, 2026
Added security-events: write to kics.yml permissions so the SARIF upload step can actually push results to GitHub code scanning. This broke silently after #31 added permissions: contents: read. Moved the upgrade test from weekly-on-Sunday plus push-to-main to a daily schedule. Running on push was too late to catch regressions since PRs never trigger these tests, and weekly was too infrequent. Removed tests/unit/plugins/module_utils/test_api.py which imports plugins/module_utils/api.py — a module that doesn't exist in the repo. The test was never run by CI anyway. Fixed test_cert_info.py to skip date field assertions since the fixture P12 cert has a hardcoded expiry of 2026-03-28 that would cause the test to fail once the cert expires. Updated plugins/modules/README.md tested versions to match current CI reality (cryptography 42-46, Python 3.11-3.12, ansible-core 2.18-2.20).
Oddly
added a commit
that referenced
this pull request
Mar 4, 2026
Added security-events: write to kics.yml permissions so the SARIF upload step can actually push results to GitHub code scanning. This broke silently after #31 added permissions: contents: read. Moved the upgrade test from weekly-on-Sunday plus push-to-main to a daily schedule. Running on push was too late to catch regressions since PRs never trigger these tests, and weekly was too infrequent. Removed tests/unit/plugins/module_utils/test_api.py which imports plugins/module_utils/api.py — a module that doesn't exist in the repo. The test was never run by CI anyway. Fixed test_cert_info.py to skip date field assertions since the fixture P12 cert has a hardcoded expiry of 2026-03-28 that would cause the test to fail once the cert expires. Updated plugins/modules/README.md tested versions to match current CI reality (cryptography 42-46, Python 3.11-3.12, ansible-core 2.18-2.20).
Oddly
added a commit
that referenced
this pull request
Mar 4, 2026
Added security-events: write to kics.yml permissions so the SARIF upload step can actually push results to GitHub code scanning. This broke silently after #31 added permissions: contents: read. Moved the upgrade test from weekly-on-Sunday plus push-to-main to a daily schedule. Running on push was too late to catch regressions since PRs never trigger these tests, and weekly was too infrequent. Removed tests/unit/plugins/module_utils/test_api.py which imports plugins/module_utils/api.py — a module that doesn't exist in the repo. The test was never run by CI anyway. Fixed test_cert_info.py to skip date field assertions since the fixture P12 cert has a hardcoded expiry of 2026-03-28 that would cause the test to fail once the cert expires. Updated plugins/modules/README.md tested versions to match current CI reality (cryptography 42-46, Python 3.11-3.12, ansible-core 2.18-2.20).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Comprehensive review and fixes across all roles for ES/Kibana/Logstash/Beats 9.x compatibility, handler consistency, and CI coverage.
Rolling upgrade retries were too tight — the post-upgrade health check had 5 retries (2.5 minutes) which isn't enough for cross-major-version upgrades. Bumped to 30 (15 minutes) to match the pre-upgrade wait. The single-node upgrade molecule scenario was manually installing the 9.x package and restarting, completely bypassing the role's own upgrade code path. Rewrote it to use include_role like the multi-node scenario does. Added rockylinux9 to the single-node upgrade matrix and push triggers for the upgrade workflow.
The elasticsearch.yml template was writing xpack.security.enabled: false unconditionally — ES 9.x removed the ability to disable security and rejects this setting. Now only written for release < 9. Similarly, Kibana's sniffOnStart, sniffOnConnectionFault, and sniffInterval settings were removed in 9.x — added version guards so they're only written for < 9. The Filebeat docker input type was deprecated in 9.x — the template now renders a filestream input with the container parser instead.
Logstash pipelines.yml template changes weren't triggering a service restart (missing notify). Fixed. Also made handler guards consistent across all roles: added check_mode guards to Logstash and Beats handlers, and kibana_enable guard to the Kibana handler.
Fixed dnf version detection on Rocky Linux returning duplicate version lines from dnf info, which produced a doubled version string that failed package installation.