[Feature]: Graceful handling of WAL disk space exhaustion #4521

leonardoce · 2024-05-13T10:01:42Z

Is there an existing issue already for this feature request/idea?

I have searched for an existing issue, and could not find anything. I believe this is a new feature request to be evaluated.

What problem is this feature going to solve? Why should it be added?

PostgreSQL will cleanly shut down when there's no space left for WAL files.
The operator will perceive that condition as a failure of the primary instance, leading to a failover.

Failing over will not help because every other replica will have the same error condition.

Describe the solution you'd like

Having the operator automatically fence the primary would help because it would prevent any automatic failover from happening and would give the user time to increase the WAL disk space, fixing the root cause of the issue.

Describe alternatives you've considered

Automatically scaling up the disk space will work, too, but I feel this as a different and broader topic.

Additional context

No response

Backport?

No

Are you willing to actively contribute to this feature?

Yes

Code of Conduct

I agree to follow this project's Code of Conduct

gbartolini · 2024-05-25T06:19:06Z

Closes #3775

NiharDudam · 2024-05-29T01:35:23Z

Hi team, are we also planning to track disk statistics in cluster CR?

As a consumer I would want to track the disk usages and warn my customers that disk is filling up at certain threshold and then let cloud native pg operator shutdown the postgres when disk actually fills up

PostgreSQL will shut down cleanly when there is not enough disk space to store WAL files. The operator did not recognize this condition and, since the primary failed, was performing a failover to the most advanced replica. This action will not fix the underlying issue. Only a manual disk resize, initiated by the user, can ultimately lead to a fully working PostgreSQL cluster. This patch makes the instance manager recognize this condition and report it to the operator. Upon detecting it, the operator will not trigger a switchover and set a phase describing the situation. After the PVCs are resized, the cluster will restart working correctly. Closes: #4521 Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com> Signed-off-by: Francesco Canovai <francesco.canovai@enterprisedb.com> Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com> Signed-off-by: Jaime Silvela <jaime.silvela@enterprisedb.com> Signed-off-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com> Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com> Co-authored-by: Francesco Canovai <francesco.canovai@enterprisedb.com> Co-authored-by: Armando Ruocco <armando.ruocco@enterprisedb.com> Co-authored-by: Jaime Silvela <jaime.silvela@enterprisedb.com> Co-authored-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com>

…4404) PostgreSQL will shut down cleanly when there is not enough disk space to store WAL files. The operator did not recognize this condition and, since the primary failed, was performing a failover to the most advanced replica. This action will not fix the underlying issue. Only a manual disk resize, initiated by the user, can ultimately lead to a fully working PostgreSQL cluster. This patch makes the instance manager recognize this condition and report it to the operator. Upon detecting it, the operator will not trigger a switchover and set a phase describing the situation. After the PVCs are resized, the cluster will restart working correctly. Closes: cloudnative-pg#4521 Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com> Signed-off-by: Francesco Canovai <francesco.canovai@enterprisedb.com> Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com> Signed-off-by: Jaime Silvela <jaime.silvela@enterprisedb.com> Signed-off-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com> Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com> Co-authored-by: Francesco Canovai <francesco.canovai@enterprisedb.com> Co-authored-by: Armando Ruocco <armando.ruocco@enterprisedb.com> Co-authored-by: Jaime Silvela <jaime.silvela@enterprisedb.com> Co-authored-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com> Signed-off-by: Douglass Kirkley <dkirkley@eitccorp.com>

leonardoce added the triage Pending triage label May 13, 2024

leonardoce assigned gbartolini May 13, 2024

leonardoce mentioned this issue May 13, 2024

feat: prevent failovers when disk space is exhausted #4404

Merged

gbartolini modified the milestones: 1.24.0, 1.23.2 May 25, 2024

mnencia assigned leonardoce May 27, 2024

mnencia removed the triage Pending triage label May 27, 2024

mnencia unassigned gbartolini May 27, 2024

leonardoce closed this as completed in #4404 Jun 4, 2024

gbartolini modified the milestones: 1.23.2, 1.24.0 Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Graceful handling of WAL disk space exhaustion #4521

[Feature]: Graceful handling of WAL disk space exhaustion #4521

leonardoce commented May 13, 2024

gbartolini commented May 25, 2024

NiharDudam commented May 29, 2024

[Feature]: Graceful handling of WAL disk space exhaustion #4521

[Feature]: Graceful handling of WAL disk space exhaustion #4521

Comments

leonardoce commented May 13, 2024

Is there an existing issue already for this feature request/idea?

What problem is this feature going to solve? Why should it be added?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Backport?

Are you willing to actively contribute to this feature?

Code of Conduct

gbartolini commented May 25, 2024

NiharDudam commented May 29, 2024