Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving Reliability of statefulset RollingUpdate with Container Lifecycle Hooks #923

Open
wkd-woo opened this issue May 13, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@wkd-woo
Copy link
Contributor

wkd-woo commented May 13, 2024

Is your feature request related to a problem? Please describe.
In scenarios like a Redis version upgrade that alter the desired status of a statefulset, the statefulset's updateStrategy causes Pods to undergo a RollingUpdate.

Assuming we have a 3-member replication setup, there is a risk of data loss if a pod goes down momentarily without securing a replica, due to a lack of reconcile by the operator during the RollingUpdate.

Therefore, during the rollingUpdate process facilitated by the statefulset, it is crucial to ensure that at least one replica, synchronized with the leader, is secured.

While it is possible to think setting the statefulset's terminationGracePeriodSeconds to a sufficiently long duration to delay the rollingUpdate might be adequate,

I believe using Container Lifecycle Hooks to functionally guarantee this would significantly enhance the project’s reliability.

Describe the solution you'd like
Describe alternatives you've considered
I propose writing event code for the PreStop hook to check whether a failover-capable replica is secured before terminating the container:

If the pod designated for deletion has a redis-role of slave, then it is safe to delete the pod.

If it’s a master, wait until a currently synced replica is secured.
If already secured, proceed.
If syncing is ongoing, remain in the loop until complete.
masterSyncInProgress == 0

127.0.0.1:6379> INFO REPLICATION
# Replication
role:slave
master_host: xxx.xxx.xxx.xxx
master_port:6379
master_link_status:up
master_last_io_seconds_ago:1
master_sync_in_progress:0

...

I would like to hear what the maintainers think about this issue and the development of this feature.

If it's difficult for you to allocate time, I would like to add this feature myself and submit a Pull Request.

What version of redis-operator are you using?

redis-operator version:

Additional context
Here's the pseudo-code of the PreStop event code.

### Pseudo-Code
infoReplication := redis-cli INFO REPLICATION

role := infoReplication[role]
masterSyncInProgress := infoReplication[master_sync_in_progress]
connectedSlaved := infoReplication[connected_slaves]
masterLinkStartup := infoReplication[master_link_startup]

if role == "master":
   while !(connected_slaves > 0 && masterSyncInProgress == 0):
      sleep(1)
   else:
      exit(0)
else if role == "slave":
    while !(masterLinkStartup == "up"):
       sleep(1)
    else:
      exit(0)
```
@wkd-woo wkd-woo added the enhancement New feature or request label May 13, 2024
@sapisuper
Copy link

@wkd-woo Hi any update regarding this enhancement ?

@wkd-woo
Copy link
Contributor Author

wkd-woo commented Jun 29, 2024

@wkd-woo Hi any update regarding this enhancement ?

@sapisuper No, the maintainers don't give any feedback yet on this enhancement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants