Improving Reliability of statefulset RollingUpdate with Container Lifecycle Hooks #923

wkd-woo · 2024-05-13T11:17:28Z

Is your feature request related to a problem? Please describe.
In scenarios like a Redis version upgrade that alter the desired status of a statefulset, the statefulset's updateStrategy causes Pods to undergo a RollingUpdate.

Assuming we have a 3-member replication setup, there is a risk of data loss if a pod goes down momentarily without securing a replica, due to a lack of reconcile by the operator during the RollingUpdate.

Therefore, during the rollingUpdate process facilitated by the statefulset, it is crucial to ensure that at least one replica, synchronized with the leader, is secured.

While it is possible to think setting the statefulset's terminationGracePeriodSeconds to a sufficiently long duration to delay the rollingUpdate might be adequate,

I believe using Container Lifecycle Hooks to functionally guarantee this would significantly enhance the project’s reliability.

Describe the solution you'd like
Describe alternatives you've considered
I propose writing event code for the PreStop hook to check whether a failover-capable replica is secured before terminating the container:

If the pod designated for deletion has a redis-role of slave, then it is safe to delete the pod.

If it’s a master, wait until a currently synced replica is secured.
If already secured, proceed.
If syncing is ongoing, remain in the loop until complete.
masterSyncInProgress == 0

127.0.0.1:6379> INFO REPLICATION
# Replication
role:slave
master_host: xxx.xxx.xxx.xxx
master_port:6379
master_link_status:up
master_last_io_seconds_ago:1
master_sync_in_progress:0

...

I would like to hear what the maintainers think about this issue and the development of this feature.

If it's difficult for you to allocate time, I would like to add this feature myself and submit a Pull Request.

What version of redis-operator are you using?

redis-operator version:

Additional context
Here's the pseudo-code of the PreStop event code.

### Pseudo-Code
infoReplication := redis-cli INFO REPLICATION

role := infoReplication[role]
masterSyncInProgress := infoReplication[master_sync_in_progress]
connectedSlaved := infoReplication[connected_slaves]
masterLinkStartup := infoReplication[master_link_startup]

if role == "master":
   while !(connected_slaves > 0 && masterSyncInProgress == 0):
      sleep(1)
   else:
      exit(0)
else if role == "slave":
    while !(masterLinkStartup == "up"):
       sleep(1)
    else:
      exit(0)
```

The text was updated successfully, but these errors were encountered:

sapisuper · 2024-06-26T06:23:34Z

@wkd-woo Hi any update regarding this enhancement ?

wkd-woo · 2024-06-29T09:04:09Z

@wkd-woo Hi any update regarding this enhancement ?

@sapisuper No, the maintainers don't give any feedback yet on this enhancement.

wkd-woo added the enhancement New feature or request label May 13, 2024

com6056 mentioned this issue Jul 3, 2024

Using a readiness probe halts operator progress #1016

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving Reliability of statefulset RollingUpdate with Container Lifecycle Hooks #923

Improving Reliability of statefulset RollingUpdate with Container Lifecycle Hooks #923

wkd-woo commented May 13, 2024 •

edited

Loading

sapisuper commented Jun 26, 2024

wkd-woo commented Jun 29, 2024

Improving Reliability of statefulset RollingUpdate with Container Lifecycle Hooks #923

Improving Reliability of statefulset RollingUpdate with Container Lifecycle Hooks #923

Comments

wkd-woo commented May 13, 2024 • edited Loading

sapisuper commented Jun 26, 2024

wkd-woo commented Jun 29, 2024

wkd-woo commented May 13, 2024 •

edited

Loading