Skip to content

fix(elasticsearch): roll config restarts#137

Merged
Oddly merged 3 commits intomainfrom
fix/elasticsearch-rolling-config-restart
Apr 20, 2026
Merged

fix(elasticsearch): roll config restarts#137
Oddly merged 3 commits intomainfrom
fix/elasticsearch-rolling-config-restart

Conversation

@Oddly
Copy link
Copy Markdown
Owner

@Oddly Oddly commented Apr 16, 2026

Fixes #121. Adds health-gated rolling restarts for Elasticsearch config changes plus contract and Molecule coverage.

Summary by CodeRabbit

  • New Features

    • Health-gated, coordinated rolling restart for Elasticsearch config changes.
    • Persistent cluster-settings apply with idempotency and security-aware credential checks.
    • Upgrade-detection to determine when rolling upgrades are required and enforce safe upgrade paths.
  • Tests

    • New integration contract tests and a fake Elasticsearch API to validate rolling restart, cluster settings, and upgrade detection.
  • Chores

    • Updated ignore rules and test provisioning environment variables.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 16, 2026

📝 Walkthrough

Walkthrough

Adds a strategy-based restart flow for config-triggered Elasticsearch restarts: new role defaults and handlers, extracted tasks for upgrade-detection and cluster-settings, per-node rolling restart tasks with allocation/flush/health gating, a fake ES HTTP server, Molecule side-effect play, and multiple integration contract tests and inventories.

Changes

Cohort / File(s) Summary
Repo config & Molecule
\.gitignore, molecule/elasticsearch_default/molecule.yml
Ignore tests/integration/ except selected fixture files; add ANSIBLE_COLLECTIONS_PATH to Molecule provisioner env.
Role defaults
roles/elasticsearch/defaults/main.yml
Add new variables controlling restart strategy (rolling/direct), flush, wait-status, and retry/delay tuning for health and node rejoin checks.
Handlers
roles/elasticsearch/handlers/main.yml
Replace single restart handler with conditional handlers that set/clear restart request and dispatch either direct or rolling restart flows (gated by check mode, freshstart flags, strategy, and group size).
Upgrade detection (extracted)
roles/elasticsearch/tasks/elasticsearch-upgrade-detection.yml, roles/elasticsearch/tasks/main.yml
Move rolling-upgrade detection and upgrade-path validation into a separate task file and include it from main.yml.
Cluster settings (extracted)
roles/elasticsearch/tasks/elasticsearch-cluster-settings.yml, roles/elasticsearch/tasks/main.yml
Extract persistent cluster-settings computation, GET/conditional PUT and auth/TLS/no_log handling into a dedicated task file and include it from main.yml.
Rolling restart orchestration
roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling.yml, roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling_node.yml
Add orchestrator to compute restart hostlist, validate inputs/credentials, iterate hosts; per-node workflow disables/enables allocation, optionally flushes, restarts service, verifies rejoin via _cat/nodes, waits for health, records timestamps, and includes rescue/always cleanup with diagnostics.
Restart helper (direct path referenced)
roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling_node.yml, roles/elasticsearch/handlers/...
Introduce node-level restart verification with journal collection on failure and cleanup delegation to surviving peers.
Molecule side-effect test
molecule/elasticsearch_default/side_effect.yml
Add Molecule playbook to drive a config-change restart scenario, record PIDs before/after, and assert per-host restart and rolling ordering based on timestamps.
Fake ES HTTP server (tests)
tests/fakes/fake_es_rolling_api.py
Add threaded fake Elasticsearch-like HTTP server supporting _cluster/settings, _cluster/health, _cat/nodes, /_flush, persistent settings mutation, request logging, and per-port failure injection.
Integration tests & inventories
tests/integration/...
tests/integration/rolling_restart_contract.yml, tests/integration/rolling_restart_inventory.ini, tests/integration/elasticsearch_cluster_settings_contract.yml, tests/integration/elasticsearch_template_contract.yml, tests/integration/elasticsearch_template_inventory.ini, tests/integration/elasticsearch_upgrade_detection_contract.yml
Add contract playbooks and inventories validating cluster-settings idempotency/security gating, template rendering, upgrade-detection scenarios, and rolling-restart behaviors (flush on/off, subset restarts, credential failures, simulated rejoin failure and cleanup).

Sequence Diagram(s)

sequenceDiagram
    participant Handler as Restart Handler
    participant Orchestrator as Rolling Orchestrator
    participant NodeA as ES Node A
    participant NodeB as ES Node B
    participant Cluster as Cluster API / Master
    participant FakeAPI as Fake ES API

    Handler->>Orchestrator: Trigger restart (strategy=rolling)
    Orchestrator->>Orchestrator: validate strategy & credentials
    Orchestrator->>Cluster: PUT _cluster/settings (disable allocation)
    Cluster-->>Orchestrator: acknowledged
    Orchestrator->>NodeA: (optional) POST /_flush
    NodeA-->>Orchestrator: flush response
    Orchestrator->>NodeA: systemctl restart elasticsearch
    NodeA-->>Orchestrator: service restarted
    Orchestrator->>FakeAPI: poll /_cat/nodes until NodeA present
    FakeAPI-->>Orchestrator: node present
    Orchestrator->>Cluster: PUT _cluster/settings (enable allocation)
    Cluster-->>Orchestrator: acknowledged
    Orchestrator->>Cluster: wait for health (yellow/green)
    Cluster-->>Orchestrator: health ok
    Note right of Orchestrator: repeat for NodeB
    Orchestrator-->>Handler: Rolling restart complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly summarizes the main change: implementing rolling (sequential) restarts for Elasticsearch configuration changes rather than parallel restarts.
Linked Issues check ✅ Passed The PR fully implements the requirements from issue #121: rolling restart logic with shard allocation management, health-gated restarts, and validation that nodes rejoin before proceeding to the next node.
Out of Scope Changes check ✅ Passed All changes are directly aligned with issue #121 requirements: rolling restart handler modifications, supporting task files, test infrastructure, and configuration defaults for restart behavior control.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/elasticsearch-rolling-config-restart

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (5)
roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling.yml (1)

34-39: Redundant when condition.

The when: _elasticsearch_restart_hosts | length > 0 check on line 39 is unnecessary—looping over an empty list naturally results in zero iterations. Removing it simplifies the code.

Optional simplification
 - name: restart_and_verify_elasticsearch_rolling | Restart requested nodes one at a time
   ansible.builtin.include_tasks: restart_and_verify_elasticsearch_rolling_node.yml
   loop: "{{ _elasticsearch_restart_hosts }}"
   loop_control:
     loop_var: _elasticsearch_restart_host
-  when: _elasticsearch_restart_hosts | length > 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling.yml`
around lines 34 - 39, The task "restart_and_verify_elasticsearch_rolling |
Restart requested nodes one at a time" redundantly uses "when:
_elasticsearch_restart_hosts | length > 0" before looping; remove that when
condition and rely on looping over "loop: {{ _elasticsearch_restart_hosts }}"
(with loop_control.loop_var: _elasticsearch_restart_host) so the include_tasks
call to restart_and_verify_elasticsearch_rolling_node.yml naturally no-ops for
an empty list.
.gitignore (1)

10-18: Redundant gitignore rule on line 10.

Line 10 (tests/integration/) ignores the directory itself, but line 11 immediately negates it with !tests/integration/. Since line 12 (tests/integration/*) already ignores the directory contents, line 10 is redundant and creates confusion.

Consider removing line 10 if you only want to ignore contents while tracking specific files:

Proposed simplification
-tests/integration/
-!tests/integration/
 tests/integration/*
 !tests/integration/elasticsearch_cluster_settings_contract.yml
 !tests/integration/elasticsearch_template_contract.yml
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.gitignore around lines 10 - 18, Remove the redundant ignore rule
"tests/integration/"; keep "tests/integration/*" and the explicit negations
(e.g., "!tests/integration/elasticsearch_cluster_settings_contract.yml", etc.)
so the directory is not ignored but its contents are selectively
ignored/tracked—delete the lone "tests/integration/" entry from the .gitignore.
tests/fakes/fake_es_rolling_api.py (1)

113-128: Consider adding graceful shutdown capability.

The server blocks indefinitely with threading.Event().wait() which works for the test scenario (killed via PID), but doesn't support graceful shutdown. This is acceptable for test infrastructure, but a signal handler could make debugging easier.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/fakes/fake_es_rolling_api.py` around lines 113 - 128, The main()
function currently blocks forever with threading.Event().wait(), preventing
graceful shutdown; modify main to create a shared threading.Event (e.g.,
stop_event), pass it into serve (or make it accessible to State/serve), and
install signal handlers for SIGINT/SIGTERM that set stop_event so threads can
exit cleanly; after starting threads, wait on stop_event and then join threads
(or allow serve to exit when stop_event.is_set()), ensuring State/serve respect
the stop_event for orderly teardown.
molecule/elasticsearch_default/side_effect.yml (1)

23-26: Add error handling for PID capture.

If Elasticsearch isn't running when this task executes, pgrep returns non-zero and the task fails. Consider adding failed_when: false or checking the return code, especially since this runs at the start of the side effect test.

Proposed fix
     - name: Record ES PID before config change
       ansible.builtin.command: pgrep -f 'org.elasticsearch.bootstrap.Elasticsearch'
       register: _es_pid_before_config_restart
       changed_when: false
+      failed_when: false
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@molecule/elasticsearch_default/side_effect.yml` around lines 23 - 26, The
"Record ES PID before config change" task using the command pgrep -f
'org.elasticsearch.bootstrap.Elasticsearch' can fail if Elasticsearch isn't
running; update this task (the one that registers _es_pid_before_config_restart)
to avoid failing the play by adding explicit error handling such as failed_when:
false (or ignore_errors: true) and keep changed_when: false so the play
continues when pgrep returns non-zero; ensure the task still registers
_es_pid_before_config_restart and downstream tasks check its.rc or stdout to
detect absence of a PID.
tests/integration/rolling_restart_contract.yml (1)

88-94: Assertion patterns may not match JSONL key order.

The assertions at lines 91-92 search for '"path": "/_flush", "port": 19200'. However, looking at the fake API's log() method (context snippet 2), keys are sorted alphabetically (sort_keys=True), so the actual order is: body, method, path, port. The pattern should work since both "path" and "port" appear consecutively in sorted order.

Lines 93-94 assert "primaries" and "null" appear in the log—these likely come from the allocation settings PUT body. Consider adding comments explaining what these assertions verify.

Clarify assertion intent
     - name: Assert both fake nodes were processed with flush
       ansible.builtin.assert:
         that:
           - '_fake_api_log.stdout is search(''"path": "/_flush", "port": 19200'')'
           - '_fake_api_log.stdout is search(''"path": "/_flush", "port": 19201'')'
+          # Verify allocation settings were applied (primaries allocation, null to reset)
           - "_fake_api_log.stdout is search('primaries')"
           - "_fake_api_log.stdout is search('null')"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/rolling_restart_contract.yml` around lines 88 - 94, The
current assertions in tests/integration/rolling_restart_contract.yml check
literal key order in _fake_api_log.stdout (e.g., '"path": "/_flush", "port":
19200') which is fragile given the fake API's log() uses sort_keys=True; update
the test to either use order-independent checks (e.g., separate contains checks
for '"path": "/_flush"' and '"port": 19200' or a regex that allows any ordering)
or keep the existing checks but add a clear comment referencing the fake API's
log() sort_keys=True behavior and that the assertion expects alphabetic ordering
(and also clarify what the '"primaries"' and '"null"' assertions validate) so
intent is explicit; target the assertions on _fake_api_log and mention
log()/sort_keys=True in the comment or refactor the two lines that combine path
and port into independent assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@molecule/elasticsearch_default/molecule.yml`:
- Line 24: The ANSIBLE_COLLECTIONS_PATH currently contains a hardcoded,
user-specific path "/home/sam/.ansible/collections"; update the
ANSIBLE_COLLECTIONS_PATH value to avoid a fixed user home and use a portable
variable such as ${HOME} (or fallback env var) instead — e.g. set
ANSIBLE_COLLECTIONS_PATH to
"${MOLECULE_PROJECT_DIRECTORY}/.ansible/collections:${HOME}/.ansible/collections:/usr/share/ansible/collections"
or use an env-default pattern like
"${ANSIBLE_COLLECTIONS_PATH:-${MOLECULE_PROJECT_DIRECTORY}/.ansible/collections:${HOME}/.ansible/collections:/usr/share/ansible/collections}"
so contributors' home directories are resolved dynamically.

In `@roles/elasticsearch/handlers/main.yml`:
- Around line 3-12: The when conditions in the handler reference registered
variables that may be undefined (_elasticsearch_freshstart,
_elasticsearch_freshstart_security, and similar vars at the other locations),
causing failures when accessing .changed; update each when clause to guard those
accesses by using a safe default or defined check (e.g. replace occurrences like
"not _elasticsearch_freshstart.changed | bool" with "not
(_elasticsearch_freshstart.changed | default(false) | bool)" or include
"_elasticsearch_freshstart is defined and ..." ), and apply the same pattern to
the other referenced vars mentioned (the checks around
_elasticsearch_freshstart_security and the other occurrences on lines 23-24,
37-38, and 50-51) so .changed is never accessed on an undefined variable.

In `@roles/elasticsearch/tasks/elasticsearch-cluster-settings.yml`:
- Around line 3-8: The task sets _elasticsearch_effective_cluster_settings using
elasticsearch_logsdb without a default which can raise an undefined variable
error; update the expression in the ansible.builtin.set_fact that builds
_elasticsearch_effective_cluster_settings to guard elasticsearch_logsdb with a
default (e.g., use elasticsearch_logsdb | default(false) | bool) so the ternary
evaluation always has a boolean value, keeping the existing
combine(elasticsearch_cluster_settings | default({})) logic intact.

In `@roles/elasticsearch/tasks/elasticsearch-upgrade-detection.yml`:
- Around line 14-23: The task "elasticsearch-upgrade-detection" sets
_elasticsearch_needs_rolling_upgrade but uses elasticstack_release without a
defined check; update the Jinja2 expression in the ansible.builtin.set_fact to
guard uses of elasticstack_release (e.g., add an "elasticstack_release is
defined and" before the elasticstack_release | int comparison) so the whole OR
clause only evaluates that comparison when elasticstack_release exists, leaving
the rest of the logic intact in the _elasticsearch_needs_rolling_upgrade
assignment.

In `@roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling_node.yml`:
- Line 16: The current conditional for _elasticsearch_health_statuses uses
elasticsearch_config_restart_wait_status and may allow 'yellow' for the
post-restart gate; change the logic so that when re-enabling shard allocation
after a node restart the expected health is strictly ['green'] (i.e. set
_elasticsearch_health_statuses = ['green'] for the post-reenable step regardless
of elasticsearch_config_restart_wait_status), update the same logic sites that
reuse _elasticsearch_health_statuses (references: _elasticsearch_health_statuses
and elasticsearch_config_restart_wait_status) so the gate that runs after
re-enabling allocation always waits for green before proceeding.
- Around line 225-261: The two cluster-wide API calls in the tasks named
"restart_and_verify_elasticsearch_rolling_node | Re-enable shard allocation" and
"restart_and_verify_elasticsearch_rolling_node | Wait for cluster health after
restart" currently delegate to _elasticsearch_restart_host (the node being
restarted) — change both to delegate to a stable API host variable instead (e.g.
_elasticsearch_cluster_api_host) and ensure that variable is computed to exclude
the restart target (for example: set _elasticsearch_cluster_api_host to
(groups['elasticsearch'] | difference([_elasticsearch_restart_host]) | first) or
a manually chosen admin node); update the tasks' delegate_to to "{{
_elasticsearch_cluster_api_host | default(_elasticsearch_restart_host) }}" so
the cleanup and health checks run via a known-live cluster node, and add the
variable resolution earlier in the playbook or role to guarantee a fallback if
no alternate host exists.

---

Nitpick comments:
In @.gitignore:
- Around line 10-18: Remove the redundant ignore rule "tests/integration/"; keep
"tests/integration/*" and the explicit negations (e.g.,
"!tests/integration/elasticsearch_cluster_settings_contract.yml", etc.) so the
directory is not ignored but its contents are selectively ignored/tracked—delete
the lone "tests/integration/" entry from the .gitignore.

In `@molecule/elasticsearch_default/side_effect.yml`:
- Around line 23-26: The "Record ES PID before config change" task using the
command pgrep -f 'org.elasticsearch.bootstrap.Elasticsearch' can fail if
Elasticsearch isn't running; update this task (the one that registers
_es_pid_before_config_restart) to avoid failing the play by adding explicit
error handling such as failed_when: false (or ignore_errors: true) and keep
changed_when: false so the play continues when pgrep returns non-zero; ensure
the task still registers _es_pid_before_config_restart and downstream tasks
check its.rc or stdout to detect absence of a PID.

In `@roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling.yml`:
- Around line 34-39: The task "restart_and_verify_elasticsearch_rolling |
Restart requested nodes one at a time" redundantly uses "when:
_elasticsearch_restart_hosts | length > 0" before looping; remove that when
condition and rely on looping over "loop: {{ _elasticsearch_restart_hosts }}"
(with loop_control.loop_var: _elasticsearch_restart_host) so the include_tasks
call to restart_and_verify_elasticsearch_rolling_node.yml naturally no-ops for
an empty list.

In `@tests/fakes/fake_es_rolling_api.py`:
- Around line 113-128: The main() function currently blocks forever with
threading.Event().wait(), preventing graceful shutdown; modify main to create a
shared threading.Event (e.g., stop_event), pass it into serve (or make it
accessible to State/serve), and install signal handlers for SIGINT/SIGTERM that
set stop_event so threads can exit cleanly; after starting threads, wait on
stop_event and then join threads (or allow serve to exit when
stop_event.is_set()), ensuring State/serve respect the stop_event for orderly
teardown.

In `@tests/integration/rolling_restart_contract.yml`:
- Around line 88-94: The current assertions in
tests/integration/rolling_restart_contract.yml check literal key order in
_fake_api_log.stdout (e.g., '"path": "/_flush", "port": 19200') which is fragile
given the fake API's log() uses sort_keys=True; update the test to either use
order-independent checks (e.g., separate contains checks for '"path": "/_flush"'
and '"port": 19200' or a regex that allows any ordering) or keep the existing
checks but add a clear comment referencing the fake API's log() sort_keys=True
behavior and that the assertion expects alphabetic ordering (and also clarify
what the '"primaries"' and '"null"' assertions validate) so intent is explicit;
target the assertions on _fake_api_log and mention log()/sort_keys=True in the
comment or refactor the two lines that combine path and port into independent
assertions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bc10b5c3-0270-4780-8150-92c82c54e803

📥 Commits

Reviewing files that changed from the base of the PR and between e236b4f and ad22296.

📒 Files selected for processing (17)
  • .gitignore
  • molecule/elasticsearch_default/molecule.yml
  • molecule/elasticsearch_default/side_effect.yml
  • roles/elasticsearch/defaults/main.yml
  • roles/elasticsearch/handlers/main.yml
  • roles/elasticsearch/tasks/elasticsearch-cluster-settings.yml
  • roles/elasticsearch/tasks/elasticsearch-upgrade-detection.yml
  • roles/elasticsearch/tasks/main.yml
  • roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling.yml
  • roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling_node.yml
  • tests/fakes/fake_es_rolling_api.py
  • tests/integration/elasticsearch_cluster_settings_contract.yml
  • tests/integration/elasticsearch_template_contract.yml
  • tests/integration/elasticsearch_template_inventory.ini
  • tests/integration/elasticsearch_upgrade_detection_contract.yml
  • tests/integration/rolling_restart_contract.yml
  • tests/integration/rolling_restart_inventory.ini

Comment thread molecule/elasticsearch_default/molecule.yml Outdated
Comment thread roles/elasticsearch/handlers/main.yml
Comment thread roles/elasticsearch/tasks/elasticsearch-cluster-settings.yml
Comment thread roles/elasticsearch/tasks/elasticsearch-upgrade-detection.yml Outdated
else hostvars[_elasticsearch_restart_host].elasticsearch_http_protocol
| default(elasticsearch_http_protocol | default('http')) }}
_elasticsearch_restart_port: "{{ hostvars[_elasticsearch_restart_host].elasticstack_elasticsearch_http_port | default(elasticstack_elasticsearch_http_port) }}"
_elasticsearch_health_statuses: "{{ ['green'] if elasticsearch_config_restart_wait_status == 'green' else ['green', 'yellow'] }}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Require green after re-enabling allocation.

_elasticsearch_health_statuses is reused for the post-restart gate, so with the default elasticsearch_config_restart_wait_status: yellow this can advance to the next node while the cluster is still yellow. Issue #121 explicitly requires waiting for green before moving on, otherwise the next restart can start with replicas still unassigned.

Suggested change
-    _elasticsearch_health_statuses: "{{ ['green'] if elasticsearch_config_restart_wait_status == 'green' else ['green', 'yellow'] }}"
+    _elasticsearch_pre_restart_health_statuses: "{{ ['green'] if elasticsearch_config_restart_wait_status == 'green' else ['green', 'yellow'] }}"
+    _elasticsearch_post_restart_health_statuses: ['green']
@@
-        - (_elasticsearch_pre_restart_health.json.status | default('')) in _elasticsearch_health_statuses
+        - (_elasticsearch_pre_restart_health.json.status | default('')) in _elasticsearch_pre_restart_health_statuses
@@
-        - (_elasticsearch_post_restart_health.json.status | default('')) in _elasticsearch_health_statuses
+        - (_elasticsearch_post_restart_health.json.status | default('')) in _elasticsearch_post_restart_health_statuses
@@
-        - (_elasticsearch_post_restart_health.json.status | default('')) in _elasticsearch_health_statuses
+        - (_elasticsearch_post_restart_health.json.status | default('')) in _elasticsearch_post_restart_health_statuses

Also applies to: 166-183, 243-261

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling_node.yml`
at line 16, The current conditional for _elasticsearch_health_statuses uses
elasticsearch_config_restart_wait_status and may allow 'yellow' for the
post-restart gate; change the logic so that when re-enabling shard allocation
after a node restart the expected health is strictly ['green'] (i.e. set
_elasticsearch_health_statuses = ['green'] for the post-reenable step regardless
of elasticsearch_config_restart_wait_status), update the same logic sites that
reuse _elasticsearch_health_statuses (references: _elasticsearch_health_statuses
and elasticsearch_config_restart_wait_status) so the gate that runs after
re-enabling allocation always waits for green before proceeding.

Comment thread roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling_node.yml Outdated
@Oddly Oddly added ci:run Trigger gated pull request CI and removed ci:run Trigger gated pull request CI labels Apr 19, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling_node.yml (1)

16-16: ⚠️ Potential issue | 🟠 Major

Require green after allocation is re-enabled.

_elasticsearch_health_statuses is still reused for the post-restart gates, so a run with elasticsearch_config_restart_wait_status: yellow can proceed while replicas are still unassigned. That breaks the rolling-restart contract from issue #121 for the post-reenable phase.

Suggested fix
-    _elasticsearch_health_statuses: "{{ ['green'] if elasticsearch_config_restart_wait_status == 'green' else ['green', 'yellow'] }}"
+    _elasticsearch_pre_restart_health_statuses: "{{ ['green'] if elasticsearch_config_restart_wait_status == 'green' else ['green', 'yellow'] }}"
+    _elasticsearch_post_restart_health_statuses: ['green']
@@
-        - (_elasticsearch_pre_restart_health.json.status | default('')) in _elasticsearch_health_statuses
+        - (_elasticsearch_pre_restart_health.json.status | default('')) in _elasticsearch_pre_restart_health_statuses
@@
-        - (_elasticsearch_post_restart_health.json.status | default('')) in _elasticsearch_health_statuses
+        - (_elasticsearch_post_restart_health.json.status | default('')) in _elasticsearch_post_restart_health_statuses
@@
-        - (_elasticsearch_post_restart_health.json.status | default('')) in _elasticsearch_health_statuses
+        - (_elasticsearch_post_restart_health.json.status | default('')) in _elasticsearch_post_restart_health_statuses

Also applies to: 181-193, 258-270

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling_node.yml`
at line 16, The current variable _elasticsearch_health_statuses uses
elasticsearch_config_restart_wait_status for both pre-restart and post-reenable
checks, letting a run proceed with yellow even after shard allocation is
re-enabled; change the post-reenable gate to require ['green'] unconditionally.
Locate uses of _elasticsearch_health_statuses around the post-restart
verification (references to _elasticsearch_health_statuses and
elasticsearch_config_restart_wait_status) and either introduce a separate
variable (e.g., _elasticsearch_health_statuses_post_reenable = ['green']) or
override the value for the post-reenable checks so that after allocation is
re-enabled the code always waits for ['green'].
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@roles/elasticsearch/handlers/main.yml`:
- Around line 20-22: The Jinja conditional uses
groups[elasticstack_elasticsearch_group_name] without a local default, which can
fail when this role is used standalone; update every occurrence of
groups[elasticstack_elasticsearch_group_name] in the file (the three
conditionals matching the shown diff blocks) to use the same safe fallback as
elsewhere: groups[elasticstack_elasticsearch_group_name] | default([]) so the
expression becomes e.g. groups[elasticstack_elasticsearch_group_name] |
default([]) | length <= 1 (or the equivalent used in the first block), ensuring
the direct/rolling decision won't error when the group variable is missing.

In `@roles/elasticsearch/tasks/elasticsearch-cluster-settings.yml`:
- Around line 35-39: The idempotency check in the Jinja loop (_needs_update) can
mis-detect boolean changes because it compares value | string to _current[key] |
string while Elasticsearch may return lowercase "true"/"false"; update the
comparison in the loop that references _elasticsearch_effective_cluster_settings
and _current so both sides are normalized (e.g., apply |string|lower to
_current[key] and value, or coerce both to boolean using |bool) before comparing
and keep the existing None check for _current.get(key); modify the conditional
that sets ns.changed to use this normalized comparison so boolean settings no
longer trigger repeated PUTs.

---

Duplicate comments:
In `@roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling_node.yml`:
- Line 16: The current variable _elasticsearch_health_statuses uses
elasticsearch_config_restart_wait_status for both pre-restart and post-reenable
checks, letting a run proceed with yellow even after shard allocation is
re-enabled; change the post-reenable gate to require ['green'] unconditionally.
Locate uses of _elasticsearch_health_statuses around the post-restart
verification (references to _elasticsearch_health_statuses and
elasticsearch_config_restart_wait_status) and either introduce a separate
variable (e.g., _elasticsearch_health_statuses_post_reenable = ['green']) or
override the value for the post-reenable checks so that after allocation is
re-enabled the code always waits for ['green'].
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 451fc45c-53e9-4c53-8b5f-c35982551990

📥 Commits

Reviewing files that changed from the base of the PR and between ad22296 and 25f7be9.

📒 Files selected for processing (8)
  • molecule/elasticsearch_default/molecule.yml
  • roles/elasticsearch/defaults/main.yml
  • roles/elasticsearch/handlers/main.yml
  • roles/elasticsearch/tasks/elasticsearch-cluster-settings.yml
  • roles/elasticsearch/tasks/elasticsearch-upgrade-detection.yml
  • roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling_node.yml
  • tests/fakes/fake_es_rolling_api.py
  • tests/integration/rolling_restart_contract.yml
✅ Files skipped from review due to trivial changes (1)
  • tests/fakes/fake_es_rolling_api.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • molecule/elasticsearch_default/molecule.yml
  • roles/elasticsearch/tasks/elasticsearch-upgrade-detection.yml
  • roles/elasticsearch/defaults/main.yml

Comment thread roles/elasticsearch/handlers/main.yml
Comment on lines +35 to +39
_needs_update: >-
{% set ns = namespace(changed=false) %}
{% for key, value in _elasticsearch_effective_cluster_settings.items() %}
{% if _current.get(key) is none or _current[key] | string != value | string %}
{% set ns.changed = true %}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Boolean setting values can break idempotent diff detection.

On Line 38, string comparison can mis-detect changes when desired values are YAML booleans (True/False) but Elasticsearch returns lowercase strings (true/false), causing repeated PUTs.

Proposed fix
-        _needs_update: >-
-          {% set ns = namespace(changed=false) %}
-          {% for key, value in _elasticsearch_effective_cluster_settings.items() %}
-          {%   if _current.get(key) is none or _current[key] | string != value | string %}
-          {%     set ns.changed = true %}
-          {%   endif %}
-          {% endfor %}
-          {{ ns.changed }}
+        _needs_update: >-
+          {% set ns = namespace(changed=false) %}
+          {% for key, value in _elasticsearch_effective_cluster_settings.items() %}
+          {%   if value is boolean %}
+          {%     set expected = 'true' if value else 'false' %}
+          {%   else %}
+          {%     set expected = value | string %}
+          {%   endif %}
+          {%   if _current.get(key) is none or (_current[key] | string) != expected %}
+          {%     set ns.changed = true %}
+          {%   endif %}
+          {% endfor %}
+          {{ ns.changed }}

As per coding guidelines, "Review for Ansible best practices: idempotency (command/shell needs creates/removes guards), correct handler notifications, proper use of become/become_user, and platform conditionals (Debian vs RHEL)."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@roles/elasticsearch/tasks/elasticsearch-cluster-settings.yml` around lines 35
- 39, The idempotency check in the Jinja loop (_needs_update) can mis-detect
boolean changes because it compares value | string to _current[key] | string
while Elasticsearch may return lowercase "true"/"false"; update the comparison
in the loop that references _elasticsearch_effective_cluster_settings and
_current so both sides are normalized (e.g., apply |string|lower to
_current[key] and value, or coerce both to boolean using |bool) before comparing
and keep the existing None check for _current.get(key); modify the conditional
that sets ns.changed to use this normalized comparison so boolean settings no
longer trigger repeated PUTs.

@Oddly Oddly added the ci:run Trigger gated pull request CI label Apr 19, 2026
Oddly added 2 commits April 19, 2026 03:36
We drop a local /home/sam path that snuck into molecule env, guard the
handler and task expressions that dereference sentinel facts so standalone
role use won't trip an undefined-variable, and default the post-restart
wait to green to match the issue spec and upstream rolling-upgrade docs.

The cleanup in the always: block previously delegated back to the node
we had just tried to restart, so if that node was the one failing we'd
lose our chance to re-enable shard allocation on the rest of the cluster.
It now picks a surviving peer from the elasticsearch group. The contract
test grew a new scenario that forces a rejoin failure on es1 and asserts
the cleanup PUT lands on es2.
@Oddly Oddly force-pushed the fix/elasticsearch-rolling-config-restart branch from 25f7be9 to 3ab5c67 Compare April 19, 2026 01:37
@Oddly Oddly added ci:run Trigger gated pull request CI and removed ci:run Trigger gated pull request CI labels Apr 19, 2026
@github-actions github-actions Bot removed the ci:run Trigger gated pull request CI label Apr 19, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
roles/elasticsearch/tasks/elasticsearch-cluster-settings.yml (1)

35-39: ⚠️ Potential issue | 🟠 Major

Normalize booleans before comparison to preserve idempotency.

Line 38 still compares raw string forms; YAML booleans (True/False) can differ from Elasticsearch-returned lowercase values (true/false), causing repeated PUTs.

Suggested fix
         _needs_update: >-
           {% set ns = namespace(changed=false) %}
           {% for key, value in _elasticsearch_effective_cluster_settings.items() %}
-          {%   if _current.get(key) is none or _current[key] | string != value | string %}
+          {%   if value is boolean %}
+          {%     set expected = 'true' if value else 'false' %}
+          {%   else %}
+          {%     set expected = value | string %}
+          {%   endif %}
+          {%   if _current.get(key) is none or (_current[key] | string) != expected %}
           {%     set ns.changed = true %}
           {%   endif %}
           {% endfor %}
           {{ ns.changed }}

As per coding guidelines, "Review for Ansible best practices: idempotency (command/shell needs creates/removes guards), correct handler notifications, proper use of become/become_user, and platform conditionals (Debian vs RHEL)."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@roles/elasticsearch/tasks/elasticsearch-cluster-settings.yml` around lines 35
- 39, The comparison inside _needs_update can misdetect changes because YAML
booleans may be True/False while Elasticsearch returns lowercase true/false;
update the loop comparing _elasticsearch_effective_cluster_settings and _current
so both sides are normalized before comparing: for each key use a normalized
form like (value is boolean ? (value | string | lower) : (value | string)) and
the same normalization for _current[key] (check _current.get(key) is defined
first), then set ns.changed = true only if the normalized strings differ; this
targets the existing symbols _needs_update,
_elasticsearch_effective_cluster_settings, _current and ns.changed.
roles/elasticsearch/handlers/main.yml (1)

20-22: ⚠️ Potential issue | 🟠 Major

Default the group-name variable before indexing groups[...].

These conditions still index groups[elasticstack_elasticsearch_group_name] directly. If this role is used standalone and that variable is unset, the expression can fail before the trailing | default([]) is applied, so the direct/rolling split crashes instead of falling back safely.

Suggested hardening
-      or groups[elasticstack_elasticsearch_group_name] | default([]) | length <= 1
+      or groups[elasticstack_elasticsearch_group_name | default('elasticsearch')] | default([]) | length <= 1
...
-      or groups[elasticstack_elasticsearch_group_name] | default([]) | length <= 1
+      or groups[elasticstack_elasticsearch_group_name | default('elasticsearch')] | default([]) | length <= 1
...
-    - groups[elasticstack_elasticsearch_group_name] | default([]) | length > 1
+    - groups[elasticstack_elasticsearch_group_name | default('elasticsearch')] | default([]) | length > 1

Also applies to: 34-36, 48-49

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@roles/elasticsearch/handlers/main.yml` around lines 20 - 22, The condition
indexes groups[elasticstack_elasticsearch_group_name] directly which can raise
if elasticstack_elasticsearch_group_name is undefined; update all such
expressions (the conditional using elasticsearch_config_restart_strategy and the
other occurrences noted at 34-36 and 48-49) to safely default before indexing,
e.g. use groups.get(elasticstack_elasticsearch_group_name, []) | default([]) or
groups.get(elasticstack_elasticsearch_group_name, []) | length where appropriate
so the expression never attempts to index with an undefined variable; apply this
change to every occurrence of groups[elasticstack_elasticsearch_group_name] in
the handler file.
roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling_node.yml (1)

16-16: ⚠️ Potential issue | 🟠 Major

Require green after re-enabling allocation.

This status list is reused by the post-restart health checks, so elasticsearch_config_restart_wait_status: yellow still allows the workflow to advance while the cluster is yellow. Issue #121 requires the post-allocation gate to wait for green before moving to the next node.

Suggested fix
-    _elasticsearch_health_statuses: "{{ ['green'] if elasticsearch_config_restart_wait_status == 'green' else ['green', 'yellow'] }}"
+    _elasticsearch_pre_restart_health_statuses: "{{ ['green'] if elasticsearch_config_restart_wait_status == 'green' else ['green', 'yellow'] }}"
+    _elasticsearch_post_restart_health_statuses: ['green']

Then use _elasticsearch_pre_restart_health_statuses for the pre-restart wait, and _elasticsearch_post_restart_health_statuses for the post-reenable waits.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling_node.yml`
at line 16, The current single variable _elasticsearch_health_statuses allows a
yellow cluster to pass post-restart checks; split it into two variables and
update usages: create _elasticsearch_pre_restart_health_statuses = "{{ ['green']
if elasticsearch_config_restart_wait_status == 'green' else ['green','yellow']
}}" and _elasticsearch_post_restart_health_statuses = "['green']" and replace
references that run before the restart with
_elasticsearch_pre_restart_health_statuses and references that run after
re-enabling allocation with _elasticsearch_post_restart_health_statuses so the
post-allocation gate always requires green.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@roles/elasticsearch/handlers/main.yml`:
- Around line 20-22: The condition indexes
groups[elasticstack_elasticsearch_group_name] directly which can raise if
elasticstack_elasticsearch_group_name is undefined; update all such expressions
(the conditional using elasticsearch_config_restart_strategy and the other
occurrences noted at 34-36 and 48-49) to safely default before indexing, e.g.
use groups.get(elasticstack_elasticsearch_group_name, []) | default([]) or
groups.get(elasticstack_elasticsearch_group_name, []) | length where appropriate
so the expression never attempts to index with an undefined variable; apply this
change to every occurrence of groups[elasticstack_elasticsearch_group_name] in
the handler file.

In `@roles/elasticsearch/tasks/elasticsearch-cluster-settings.yml`:
- Around line 35-39: The comparison inside _needs_update can misdetect changes
because YAML booleans may be True/False while Elasticsearch returns lowercase
true/false; update the loop comparing _elasticsearch_effective_cluster_settings
and _current so both sides are normalized before comparing: for each key use a
normalized form like (value is boolean ? (value | string | lower) : (value |
string)) and the same normalization for _current[key] (check _current.get(key)
is defined first), then set ns.changed = true only if the normalized strings
differ; this targets the existing symbols _needs_update,
_elasticsearch_effective_cluster_settings, _current and ns.changed.

In `@roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling_node.yml`:
- Line 16: The current single variable _elasticsearch_health_statuses allows a
yellow cluster to pass post-restart checks; split it into two variables and
update usages: create _elasticsearch_pre_restart_health_statuses = "{{ ['green']
if elasticsearch_config_restart_wait_status == 'green' else ['green','yellow']
}}" and _elasticsearch_post_restart_health_statuses = "['green']" and replace
references that run before the restart with
_elasticsearch_pre_restart_health_statuses and references that run after
re-enabling allocation with _elasticsearch_post_restart_health_statuses so the
post-allocation gate always requires green.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8aaad5e7-0a56-4d9c-9cbf-f8f6cf343b19

📥 Commits

Reviewing files that changed from the base of the PR and between 25f7be9 and 3ab5c67.

📒 Files selected for processing (17)
  • .gitignore
  • molecule/elasticsearch_default/molecule.yml
  • molecule/elasticsearch_default/side_effect.yml
  • roles/elasticsearch/defaults/main.yml
  • roles/elasticsearch/handlers/main.yml
  • roles/elasticsearch/tasks/elasticsearch-cluster-settings.yml
  • roles/elasticsearch/tasks/elasticsearch-upgrade-detection.yml
  • roles/elasticsearch/tasks/main.yml
  • roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling.yml
  • roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling_node.yml
  • tests/fakes/fake_es_rolling_api.py
  • tests/integration/elasticsearch_cluster_settings_contract.yml
  • tests/integration/elasticsearch_template_contract.yml
  • tests/integration/elasticsearch_template_inventory.ini
  • tests/integration/elasticsearch_upgrade_detection_contract.yml
  • tests/integration/rolling_restart_contract.yml
  • tests/integration/rolling_restart_inventory.ini
✅ Files skipped from review due to trivial changes (8)
  • .gitignore
  • tests/integration/rolling_restart_inventory.ini
  • tests/integration/elasticsearch_template_inventory.ini
  • roles/elasticsearch/defaults/main.yml
  • tests/integration/elasticsearch_upgrade_detection_contract.yml
  • tests/integration/elasticsearch_template_contract.yml
  • tests/fakes/fake_es_rolling_api.py
  • molecule/elasticsearch_default/side_effect.yml
🚧 Files skipped from review as they are similar to previous changes (4)
  • molecule/elasticsearch_default/molecule.yml
  • roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling.yml
  • roles/elasticsearch/tasks/main.yml
  • roles/elasticsearch/tasks/elasticsearch-upgrade-detection.yml

…olecule

The provisioner env overrode ANSIBLE_COLLECTIONS_PATH with a path that
doesn't include the runner's temp collections dir, so ansible couldn't
resolve the oddly.elasticstack.repos role under self-hosted CI. Every
other scenario inherits the env correctly; this one had a leftover from
local dev. Dropping it restores the CI-provided path.
@Oddly Oddly added the ci:run Trigger gated pull request CI label Apr 20, 2026
@github-actions github-actions Bot removed the ci:run Trigger gated pull request CI label Apr 20, 2026
@Oddly Oddly merged commit 0e109da into main Apr 20, 2026
116 of 198 checks passed
@Oddly Oddly deleted the fix/elasticsearch-rolling-config-restart branch April 20, 2026 07:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Restart handler restarts all nodes in parallel instead of rolling

1 participant