Bug
When a configuration change triggers the Restart Elasticsearch handler (e.g. adding LDAP SSL settings to elasticsearch.yml), all cluster nodes are restarted simultaneously. This causes full cluster downtime.
The handler in roles/elasticsearch/handlers/main.yml includes restart_and_verify_elasticsearch.yml, which calls the generic restart_and_verify_service.yml. This simply does systemctl restart elasticsearch on all nodes in parallel — no shard allocation management, no waiting for green status between nodes.
Expected behavior
Config-triggered restarts should use a rolling restart pattern:
- Disable shard allocation (
cluster.routing.allocation.enable: none)
- Perform a synced flush (if applicable)
- Restart one node
- Wait for the node to rejoin and cluster to reach yellow/green
- Re-enable shard allocation
- Wait for green status
- Proceed to next node
Current behavior
All 9 nodes restart simultaneously:
RUNNING HANDLER [restart_and_verify_service | Restart elasticsearch service]
changed: [acc-elastic-master-001.rinis.cloud]
changed: [acc-elastic-master-002.rinis.cloud]
changed: [acc-elastic-master-003.rinis.cloud]
changed: [acc-elastic-ingest-001.rinis.cloud]
changed: [acc-elastic-ingest-002.rinis.cloud]
changed: [acc-elastic-warm-001.rinis.cloud]
changed: [acc-elastic-warm-002.rinis.cloud]
changed: [acc-elastic-hot-001.rinis.cloud]
changed: [acc-elastic-hot-002.rinis.cloud]
Note
The rolling upgrade flow (elasticsearch-rolling-upgrade.yml) already implements proper rolling restart logic with shard allocation management. The config-change handler should reuse this pattern.
Workaround
Set serial: 1 on the Elasticsearch cluster play to force sequential execution, though this doesn't include the shard allocation safety steps.
Bug
When a configuration change triggers the
Restart Elasticsearchhandler (e.g. adding LDAP SSL settings toelasticsearch.yml), all cluster nodes are restarted simultaneously. This causes full cluster downtime.The handler in
roles/elasticsearch/handlers/main.ymlincludesrestart_and_verify_elasticsearch.yml, which calls the genericrestart_and_verify_service.yml. This simply doessystemctl restart elasticsearchon all nodes in parallel — no shard allocation management, no waiting for green status between nodes.Expected behavior
Config-triggered restarts should use a rolling restart pattern:
cluster.routing.allocation.enable: none)Current behavior
All 9 nodes restart simultaneously:
Note
The rolling upgrade flow (
elasticsearch-rolling-upgrade.yml) already implements proper rolling restart logic with shard allocation management. The config-change handler should reuse this pattern.Workaround
Set
serial: 1on the Elasticsearch cluster play to force sequential execution, though this doesn't include the shard allocation safety steps.