-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Graceful handling of WAL disk space exhaustion #4521
Milestone
Comments
Closes #3775 |
Hi team, are we also planning to track disk statistics in cluster CR? As a consumer I would want to track the disk usages and warn my customers that disk is filling up at certain threshold and then let cloud native pg operator shutdown the postgres when disk actually fills up |
leonardoce
added a commit
that referenced
this issue
Jun 4, 2024
PostgreSQL will shut down cleanly when there is not enough disk space to store WAL files. The operator did not recognize this condition and, since the primary failed, was performing a failover to the most advanced replica. This action will not fix the underlying issue. Only a manual disk resize, initiated by the user, can ultimately lead to a fully working PostgreSQL cluster. This patch makes the instance manager recognize this condition and report it to the operator. Upon detecting it, the operator will not trigger a switchover and set a phase describing the situation. After the PVCs are resized, the cluster will restart working correctly. Closes: #4521 Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com> Signed-off-by: Francesco Canovai <francesco.canovai@enterprisedb.com> Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com> Signed-off-by: Jaime Silvela <jaime.silvela@enterprisedb.com> Signed-off-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com> Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com> Co-authored-by: Francesco Canovai <francesco.canovai@enterprisedb.com> Co-authored-by: Armando Ruocco <armando.ruocco@enterprisedb.com> Co-authored-by: Jaime Silvela <jaime.silvela@enterprisedb.com> Co-authored-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com>
dougkirkley
pushed a commit
to dougkirkley/cloudnative-pg
that referenced
this issue
Jun 11, 2024
…4404) PostgreSQL will shut down cleanly when there is not enough disk space to store WAL files. The operator did not recognize this condition and, since the primary failed, was performing a failover to the most advanced replica. This action will not fix the underlying issue. Only a manual disk resize, initiated by the user, can ultimately lead to a fully working PostgreSQL cluster. This patch makes the instance manager recognize this condition and report it to the operator. Upon detecting it, the operator will not trigger a switchover and set a phase describing the situation. After the PVCs are resized, the cluster will restart working correctly. Closes: cloudnative-pg#4521 Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com> Signed-off-by: Francesco Canovai <francesco.canovai@enterprisedb.com> Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com> Signed-off-by: Jaime Silvela <jaime.silvela@enterprisedb.com> Signed-off-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com> Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com> Co-authored-by: Francesco Canovai <francesco.canovai@enterprisedb.com> Co-authored-by: Armando Ruocco <armando.ruocco@enterprisedb.com> Co-authored-by: Jaime Silvela <jaime.silvela@enterprisedb.com> Co-authored-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com> Signed-off-by: Douglass Kirkley <dkirkley@eitccorp.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Is there an existing issue already for this feature request/idea?
What problem is this feature going to solve? Why should it be added?
PostgreSQL will cleanly shut down when there's no space left for WAL files.
The operator will perceive that condition as a failure of the primary instance, leading to a failover.
Failing over will not help because every other replica will have the same error condition.
Describe the solution you'd like
Having the operator automatically fence the primary would help because it would prevent any automatic failover from happening and would give the user time to increase the WAL disk space, fixing the root cause of the issue.
Describe alternatives you've considered
Automatically scaling up the disk space will work, too, but I feel this as a different and broader topic.
Additional context
No response
Backport?
No
Are you willing to actively contribute to this feature?
Yes
Code of Conduct
The text was updated successfully, but these errors were encountered: