Skip to content

Role-wide 9.x compatibility review and fixes#31

Merged
Oddly merged 8 commits intomainfrom
fix/rolling-upgrade-review
Mar 1, 2026
Merged

Role-wide 9.x compatibility review and fixes#31
Oddly merged 8 commits intomainfrom
fix/rolling-upgrade-review

Conversation

@Oddly
Copy link
Copy Markdown
Owner

@Oddly Oddly commented Mar 1, 2026

Comprehensive review and fixes across all roles for ES/Kibana/Logstash/Beats 9.x compatibility, handler consistency, and CI coverage.

Rolling upgrade retries were too tight — the post-upgrade health check had 5 retries (2.5 minutes) which isn't enough for cross-major-version upgrades. Bumped to 30 (15 minutes) to match the pre-upgrade wait. The single-node upgrade molecule scenario was manually installing the 9.x package and restarting, completely bypassing the role's own upgrade code path. Rewrote it to use include_role like the multi-node scenario does. Added rockylinux9 to the single-node upgrade matrix and push triggers for the upgrade workflow.

The elasticsearch.yml template was writing xpack.security.enabled: false unconditionally — ES 9.x removed the ability to disable security and rejects this setting. Now only written for release < 9. Similarly, Kibana's sniffOnStart, sniffOnConnectionFault, and sniffInterval settings were removed in 9.x — added version guards so they're only written for < 9. The Filebeat docker input type was deprecated in 9.x — the template now renders a filestream input with the container parser instead.

Logstash pipelines.yml template changes weren't triggering a service restart (missing notify). Fixed. Also made handler guards consistent across all roles: added check_mode guards to Logstash and Beats handlers, and kibana_enable guard to the Kibana handler.

Fixed dnf version detection on Rocky Linux returning duplicate version lines from dnf info, which produced a doubled version string that failed package installation.

The post-upgrade health check after re-enabling shard allocation had
only 5 retries (2.5 min), which is far too short for clusters with
significant data. Bumped to 30 retries (15 min) to match the
pre-upgrade health check's intent.

Rewrote the single-node upgrade scenario to use the elasticsearch role
for the 8.19→9.x step instead of manual package/restart. The old
approach bypassed the role entirely, leaving its single-node upgrade
code path untested by CI. The new version mirrors the multi-node
scenario's pattern: reset the shared role guard, detect the candidate
version, set elasticstack_version, and include the role.

Added rockylinux9 to the single-node upgrade matrix and a push trigger
on upgrade-related file paths so changes to the rolling upgrade flow
are caught immediately rather than waiting for the Sunday schedule.
The docker input type is deprecated in Beats 9.x. When elasticstack_release
is 9 or higher, the template now renders a filestream input with the container
parser instead, which handles both Docker JSON and CRI log formats. The old
docker input is preserved for 8.x deployments.
Guard xpack.security.enabled: false so it's only written for ES < 9 since
9.x removed the ability to disable security entirely and rejects this setting.
Guard Kibana sniff settings (sniffOnStart, sniffOnConnectionFault, sniffInterval)
for < 9 since these were removed in Kibana 9.x.

Add missing notify on the Logstash pipelines.yml template task so changes to
queue type or pipeline definitions actually trigger a service restart.

Add missing check_mode guards to Logstash and Beats handlers (consistent with
ES and Kibana handlers), and add kibana_enable guard to the Kibana handler
(consistent with ES and Logstash handlers).
dnf info can return multiple Version lines (e.g., for x86_64 and noarch
variants). The shell command produced a doubled version string like
"9.3.1\n9.3.1" which then failed package installation. Adding head -1
ensures only the first version line is used.
@Oddly Oddly changed the title Fix rolling upgrade retries and CI coverage gaps Role-wide 9.x compatibility review and fixes Mar 1, 2026
Oddly added 2 commits March 1, 2026 12:09
api.http.host/api.http.port have been available since Logstash 8.0, so there's
no need to branch between http.* and api.http.* — just use api.http.*
unconditionally.

The filestream input type has been GA since Filebeat 7.14 and is the
recommended replacement for both the log and docker input types, which were
deprecated in 8.x and removed in 9.x. Using filestream unconditionally removes
the version branches and halves the template size for log inputs.

The remaining version conditionals (log4j2 ECSJsonLayout, Logstash plugin SSL
renames, xpack.security.enabled guard, Kibana sniff settings, monitoring
allow_legacy_collection) are genuinely 9.x-specific and need to stay.
Restricts the GITHUB_TOKEN to read-only content access across all 14 workflow
files. Resolves 20 code scanning alerts for missing-workflow-permissions.
PR and merge_group triggers now test only the latest distro per OS family
(rockylinux10/debian13, or rockylinux9/debian13 for full-stack which
excludes rocky10 due to missing EPEL Redis). Schedule and workflow_dispatch
triggers keep the full 7-distro matrix. Added daily schedule cron to all
role-specific workflows so the full matrix still runs every night.

Also unified version selection across ES, Kibana, and Logstash to handle
elasticstack_version: latest the same way beats already did — the construct
tasks skip the version suffix and install tasks pass state: latest. The
rolling upgrade guards now skip when version is 'latest' since version
comparison would fail on a non-semver string.
@Oddly Oddly merged commit 5642f3d into main Mar 1, 2026
38 checks passed
@Oddly Oddly deleted the fix/rolling-upgrade-review branch March 1, 2026 12:39
Oddly added a commit that referenced this pull request Mar 1, 2026
Added security-events: write to kics.yml permissions so the SARIF
upload step can actually push results to GitHub code scanning. This
broke silently after #31 added permissions: contents: read.

Moved the upgrade test from weekly-on-Sunday plus push-to-main to a
daily schedule. Running on push was too late to catch regressions since
PRs never trigger these tests, and weekly was too infrequent.

Removed tests/unit/plugins/module_utils/test_api.py which imports
plugins/module_utils/api.py — a module that doesn't exist in the repo.
The test was never run by CI anyway.

Fixed test_cert_info.py to skip date field assertions since the fixture
P12 cert has a hardcoded expiry of 2026-03-28 that would cause the
test to fail once the cert expires.

Updated plugins/modules/README.md tested versions to match current CI
reality (cryptography 42-46, Python 3.11-3.12, ansible-core 2.18-2.20).
Oddly added a commit that referenced this pull request Mar 1, 2026
Added security-events: write to kics.yml permissions so the SARIF
upload step can actually push results to GitHub code scanning. This
broke silently after #31 added permissions: contents: read.

Moved the upgrade test from weekly-on-Sunday plus push-to-main to a
daily schedule. Running on push was too late to catch regressions since
PRs never trigger these tests, and weekly was too infrequent.

Removed tests/unit/plugins/module_utils/test_api.py which imports
plugins/module_utils/api.py — a module that doesn't exist in the repo.
The test was never run by CI anyway.

Fixed test_cert_info.py to skip date field assertions since the fixture
P12 cert has a hardcoded expiry of 2026-03-28 that would cause the
test to fail once the cert expires.

Updated plugins/modules/README.md tested versions to match current CI
reality (cryptography 42-46, Python 3.11-3.12, ansible-core 2.18-2.20).
Oddly added a commit that referenced this pull request Mar 3, 2026
Added security-events: write to kics.yml permissions so the SARIF
upload step can actually push results to GitHub code scanning. This
broke silently after #31 added permissions: contents: read.

Moved the upgrade test from weekly-on-Sunday plus push-to-main to a
daily schedule. Running on push was too late to catch regressions since
PRs never trigger these tests, and weekly was too infrequent.

Removed tests/unit/plugins/module_utils/test_api.py which imports
plugins/module_utils/api.py — a module that doesn't exist in the repo.
The test was never run by CI anyway.

Fixed test_cert_info.py to skip date field assertions since the fixture
P12 cert has a hardcoded expiry of 2026-03-28 that would cause the
test to fail once the cert expires.

Updated plugins/modules/README.md tested versions to match current CI
reality (cryptography 42-46, Python 3.11-3.12, ansible-core 2.18-2.20).
Oddly added a commit that referenced this pull request Mar 3, 2026
Added security-events: write to kics.yml permissions so the SARIF
upload step can actually push results to GitHub code scanning. This
broke silently after #31 added permissions: contents: read.

Moved the upgrade test from weekly-on-Sunday plus push-to-main to a
daily schedule. Running on push was too late to catch regressions since
PRs never trigger these tests, and weekly was too infrequent.

Removed tests/unit/plugins/module_utils/test_api.py which imports
plugins/module_utils/api.py — a module that doesn't exist in the repo.
The test was never run by CI anyway.

Fixed test_cert_info.py to skip date field assertions since the fixture
P12 cert has a hardcoded expiry of 2026-03-28 that would cause the
test to fail once the cert expires.

Updated plugins/modules/README.md tested versions to match current CI
reality (cryptography 42-46, Python 3.11-3.12, ansible-core 2.18-2.20).
Oddly added a commit that referenced this pull request Mar 3, 2026
Added security-events: write to kics.yml permissions so the SARIF
upload step can actually push results to GitHub code scanning. This
broke silently after #31 added permissions: contents: read.

Moved the upgrade test from weekly-on-Sunday plus push-to-main to a
daily schedule. Running on push was too late to catch regressions since
PRs never trigger these tests, and weekly was too infrequent.

Removed tests/unit/plugins/module_utils/test_api.py which imports
plugins/module_utils/api.py — a module that doesn't exist in the repo.
The test was never run by CI anyway.

Fixed test_cert_info.py to skip date field assertions since the fixture
P12 cert has a hardcoded expiry of 2026-03-28 that would cause the
test to fail once the cert expires.

Updated plugins/modules/README.md tested versions to match current CI
reality (cryptography 42-46, Python 3.11-3.12, ansible-core 2.18-2.20).
Oddly added a commit that referenced this pull request Mar 3, 2026
Added security-events: write to kics.yml permissions so the SARIF
upload step can actually push results to GitHub code scanning. This
broke silently after #31 added permissions: contents: read.

Moved the upgrade test from weekly-on-Sunday plus push-to-main to a
daily schedule. Running on push was too late to catch regressions since
PRs never trigger these tests, and weekly was too infrequent.

Removed tests/unit/plugins/module_utils/test_api.py which imports
plugins/module_utils/api.py — a module that doesn't exist in the repo.
The test was never run by CI anyway.

Fixed test_cert_info.py to skip date field assertions since the fixture
P12 cert has a hardcoded expiry of 2026-03-28 that would cause the
test to fail once the cert expires.

Updated plugins/modules/README.md tested versions to match current CI
reality (cryptography 42-46, Python 3.11-3.12, ansible-core 2.18-2.20).
Oddly added a commit that referenced this pull request Mar 4, 2026
Added security-events: write to kics.yml permissions so the SARIF
upload step can actually push results to GitHub code scanning. This
broke silently after #31 added permissions: contents: read.

Moved the upgrade test from weekly-on-Sunday plus push-to-main to a
daily schedule. Running on push was too late to catch regressions since
PRs never trigger these tests, and weekly was too infrequent.

Removed tests/unit/plugins/module_utils/test_api.py which imports
plugins/module_utils/api.py — a module that doesn't exist in the repo.
The test was never run by CI anyway.

Fixed test_cert_info.py to skip date field assertions since the fixture
P12 cert has a hardcoded expiry of 2026-03-28 that would cause the
test to fail once the cert expires.

Updated plugins/modules/README.md tested versions to match current CI
reality (cryptography 42-46, Python 3.11-3.12, ansible-core 2.18-2.20).
Oddly added a commit that referenced this pull request Mar 4, 2026
Added security-events: write to kics.yml permissions so the SARIF
upload step can actually push results to GitHub code scanning. This
broke silently after #31 added permissions: contents: read.

Moved the upgrade test from weekly-on-Sunday plus push-to-main to a
daily schedule. Running on push was too late to catch regressions since
PRs never trigger these tests, and weekly was too infrequent.

Removed tests/unit/plugins/module_utils/test_api.py which imports
plugins/module_utils/api.py — a module that doesn't exist in the repo.
The test was never run by CI anyway.

Fixed test_cert_info.py to skip date field assertions since the fixture
P12 cert has a hardcoded expiry of 2026-03-28 that would cause the
test to fail once the cert expires.

Updated plugins/modules/README.md tested versions to match current CI
reality (cryptography 42-46, Python 3.11-3.12, ansible-core 2.18-2.20).
Oddly added a commit that referenced this pull request Mar 4, 2026
Added security-events: write to kics.yml permissions so the SARIF
upload step can actually push results to GitHub code scanning. This
broke silently after #31 added permissions: contents: read.

Moved the upgrade test from weekly-on-Sunday plus push-to-main to a
daily schedule. Running on push was too late to catch regressions since
PRs never trigger these tests, and weekly was too infrequent.

Removed tests/unit/plugins/module_utils/test_api.py which imports
plugins/module_utils/api.py — a module that doesn't exist in the repo.
The test was never run by CI anyway.

Fixed test_cert_info.py to skip date field assertions since the fixture
P12 cert has a hardcoded expiry of 2026-03-28 that would cause the
test to fail once the cert expires.

Updated plugins/modules/README.md tested versions to match current CI
reality (cryptography 42-46, Python 3.11-3.12, ansible-core 2.18-2.20).
Oddly added a commit that referenced this pull request Mar 4, 2026
Added security-events: write to kics.yml permissions so the SARIF
upload step can actually push results to GitHub code scanning. This
broke silently after #31 added permissions: contents: read.

Moved the upgrade test from weekly-on-Sunday plus push-to-main to a
daily schedule. Running on push was too late to catch regressions since
PRs never trigger these tests, and weekly was too infrequent.

Removed tests/unit/plugins/module_utils/test_api.py which imports
plugins/module_utils/api.py — a module that doesn't exist in the repo.
The test was never run by CI anyway.

Fixed test_cert_info.py to skip date field assertions since the fixture
P12 cert has a hardcoded expiry of 2026-03-28 that would cause the
test to fail once the cert expires.

Updated plugins/modules/README.md tested versions to match current CI
reality (cryptography 42-46, Python 3.11-3.12, ansible-core 2.18-2.20).
Oddly added a commit that referenced this pull request Mar 4, 2026
Added security-events: write to kics.yml permissions so the SARIF
upload step can actually push results to GitHub code scanning. This
broke silently after #31 added permissions: contents: read.

Moved the upgrade test from weekly-on-Sunday plus push-to-main to a
daily schedule. Running on push was too late to catch regressions since
PRs never trigger these tests, and weekly was too infrequent.

Removed tests/unit/plugins/module_utils/test_api.py which imports
plugins/module_utils/api.py — a module that doesn't exist in the repo.
The test was never run by CI anyway.

Fixed test_cert_info.py to skip date field assertions since the fixture
P12 cert has a hardcoded expiry of 2026-03-28 that would cause the
test to fail once the cert expires.

Updated plugins/modules/README.md tested versions to match current CI
reality (cryptography 42-46, Python 3.11-3.12, ansible-core 2.18-2.20).
Oddly added a commit that referenced this pull request Mar 4, 2026
Added security-events: write to kics.yml permissions so the SARIF
upload step can actually push results to GitHub code scanning. This
broke silently after #31 added permissions: contents: read.

Moved the upgrade test from weekly-on-Sunday plus push-to-main to a
daily schedule. Running on push was too late to catch regressions since
PRs never trigger these tests, and weekly was too infrequent.

Removed tests/unit/plugins/module_utils/test_api.py which imports
plugins/module_utils/api.py — a module that doesn't exist in the repo.
The test was never run by CI anyway.

Fixed test_cert_info.py to skip date field assertions since the fixture
P12 cert has a hardcoded expiry of 2026-03-28 that would cause the
test to fail once the cert expires.

Updated plugins/modules/README.md tested versions to match current CI
reality (cryptography 42-46, Python 3.11-3.12, ansible-core 2.18-2.20).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant