Improve recreation of instances #48

santiago-ventura · 2024-04-26T08:00:11Z

Overview

This pull request introduces a series of enhancements and new features for Cloud Foundry™ Services Operator. Key improvements include the implementation of exponential back-off, enhanced error handling, and comprehensive integration tests.

Motivation and Context

The changes are necessary to ensure higher stability and reliability of the operator under various conditions, especially in production environments. Enhanced how the CF Service Operator handles failures of Cloud Foundry operations such as service instance creation and deletion failures.

Description of Changes

Exponential Back-Off: Introduced logic for exponential back-off in service instance recreation, enhancing reconciliation reliability.
Retry Limits: Established a maximum number of retries for service instance recreation to prevent endless loops by introducing an annotation.
Reconcile Timeout: Added new logic to make the time interval between reconcile configurable as annotations
Test Suite:
- Added integration tests using suite_test.
- Introduce the counterfeiter library to mock CF API request.
Integration Tests: Developed specific integration tests to validate different scenarios for the creation and re-creation of CF Service instances and spaces.
Makefile Changes:
- CRD Generation, Ensured all necessary CRD files are automatically generated when make test-fast is executed.
- A new make command clean has been introduced to remove binary libraries that do not match the actual tool library versions.
- Updated Controller Tool library version to v0.14.0.
Linting Support: Implemented linting using ESLint to enforce code quality standards.
Documentation: Updated detailed documentation for annotations related to these new features within the operator.

How Has This Been Tested?

The changes have been tested on Kubernetes in development and staging environments.

Types of changes

New feature
Tests
Documentation update

Checklist

My code follows the code style of this project.
I have added tests to cover my changes.
All new tests passed.
I have updated the documentation accordingly.

uwefreidank · 2024-04-26T10:10:30Z

internal/controllers/serviceinstance_controller_integration_test.go

+			for i := 1; i <= 8; i++ {
+				fakeSpaceClient.GetInstanceReturnsOnCall(i, kNoInstance, kNoError)
+			}


is it necessary to fake exactly call 1-8, or can't we simply it like

Suggested change

for i := 1; i <= 8; i++ {

fakeSpaceClient.GetInstanceReturnsOnCall(i, kNoInstance, kNoError)

}

fakeSpaceClient.GetInstanceReturns(kNoInstance, kNoError)

Hi Uwe,

I have adopted the suggestions in 5f95d0c.
Please review. Thanks.

The original idea here was to let every call fail that is not expected.
The only exceptions were listed below

call number 0 should return failed instance

calls 1 to 8 should return empty instance

The new version of the test code will now just return no instance without an error.
I would suggest to revert to original coding.

internal/controllers/suite_test.go

Co-authored-by: uwefreidank <49634966+uwefreidank@users.noreply.github.com>

TheBigElmo · 2024-04-29T10:00:17Z

I couldn't find the logic behind the Exponential Back-Off feature. Could you please guide me?
If exponential back-off is achieved, it will be possible to set the maximum requeue time for failing resources to something like 30 or 60 minutes. This would eliminate the need for a maximum retry logic and the controller would remain the model of eventual consistency of its resources.

santiago-ventura · 2024-04-30T08:07:40Z

I couldn't find the logic behind the Exponential Back-Off feature. Could you please guide me? If exponential back-off is achieved, it will be possible to set the maximum requeue time for failing resources to something like 30 or 60 minutes. This would eliminate the need for a maximum retry logic and the controller would remain the model of eventual consistency of its resources.

Hi, we have included the logic for the Exponential Back-Off into the HandleError function.

santiago-ventura · 2024-04-30T12:11:00Z

I couldn't find the logic behind the Exponential Back-Off feature. Could you please guide me? If exponential back-off is achieved, it will be possible to set the maximum requeue time for failing resources to something like 30 or 60 minutes. This would eliminate the need for a maximum retry logic and the controller would remain the model of eventual consistency of its resources.

Hi colleague,

I understand your idea on setting a maximum requeue time for failing resources, but a maximum retry logic is a common practice as well. Exponential back-off helps manage the load by increasing delays between retries, which is useful in temporary failures. However, without a maximum retry limit, a failing resource might be continuously retried without resolution, especially if the error is serious and persistent.

The retry logic and retry enable the operator to signal that recovery may be impossible after numerous attempts, such as when 10 retries fail to resolve an issue. This avoids resource wastage and helps identify problems needing manual resolution.

Without a maximum retry, errors could trap resources in endless retries, preventing them from reaching a final state.

… test

TheBigElmo · 2024-05-02T07:57:59Z

I couldn't find the logic behind the Exponential Back-Off feature. Could you please guide me? If exponential back-off is achieved, it will be possible to set the maximum requeue time for failing resources to something like 30 or 60 minutes. This would eliminate the need for a maximum retry logic and the controller would remain the model of eventual consistency of its resources.

Hi colleague,

I understand your idea on setting a maximum requeue time for failing resources, but a maximum retry logic is a common practice as well. Exponential back-off helps manage the load by increasing delays between retries, which is useful in temporary failures. However, without a maximum retry limit, a failing resource might be continuously retried without resolution, especially if the error is serious and persistent.

The retry logic and retry enable the operator to signal that recovery may be impossible after numerous attempts, such as when 10 retries fail to resolve an issue. This avoids resource wastage and helps identify problems needing manual resolution.

Without a maximum retry, errors could trap resources in endless retries, preventing them from reaching a final state.

Hi @santiago-ventura,

In Kubernetes, maximum retry logic isn't common. The problem is that resources will never be ready without manual interaction. The other question is, why should you keep a failed instance? Either you expect the resource to be ready at some point, or it won't ever work, but then the best practice is to clean up such a resource.
Which leads to the following suggestion: if you still need such a behavior, please disable it by default and make it possible to start the operator with an additional parameter to enable it and allow to override the default maximum number of retries.

bKiralyWdf · 2024-05-02T13:34:36Z

I couldn't find the logic behind the Exponential Back-Off feature. Could you please guide me? If exponential back-off is achieved, it will be possible to set the maximum requeue time for failing resources to something like 30 or 60 minutes. This would eliminate the need for a maximum retry logic and the controller would remain the model of eventual consistency of its resources.

Hi colleague,
I understand your idea on setting a maximum requeue time for failing resources, but a maximum retry logic is a common practice as well. Exponential back-off helps manage the load by increasing delays between retries, which is useful in temporary failures. However, without a maximum retry limit, a failing resource might be continuously retried without resolution, especially if the error is serious and persistent.
The retry logic and retry enable the operator to signal that recovery may be impossible after numerous attempts, such as when 10 retries fail to resolve an issue. This avoids resource wastage and helps identify problems needing manual resolution.
Without a maximum retry, errors could trap resources in endless retries, preventing them from reaching a final state.

Hi @santiago-ventura,

In Kubernetes, maximum retry logic isn't common. The problem is that resources will never be ready without manual interaction. The other question is, why should you keep a failed instance? Either you expect the resource to be ready at some point, or it won't ever work, but then the best practice is to clean up such a resource. Which leads to the following suggestion: if you still need such a behavior, please disable it by default and make it possible to start the operator with an additional parameter to enable it and allow to override the default maximum number of retries.

I agree that we should keep the default behaviour, and only cap retries if the consumer actively wants this (i.e. sets the appropriate annotation on the ressource). I would not want to set this on an operator level, because it is possible that for some instances more retries are necessary (e.g. less reliable CF service). So I would say the default behaviour defined here https://github.com/SAP/cf-service-operator/pull/48/files#diff-10aec153ceb471e285d9e444db581f8efd500c63fecf207f114d1ab4b746b451R47 among others should be changed to "infinite retries". @santiago-ventura can you please make that change?

This is analogous to how Kubernetes handles job backoff limits (https://kubernetes.io/docs/tasks/job/pod-failure-policy/, https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-failure-policy)

santiago-ventura · 2024-05-07T06:26:30Z

Hi @TheBigElmo,

We followed your suggestion and kept the infinite retries behavior as default.
The Maximum Retries and only enable via annotation.

cf-service-operator/internal/controllers/serviceinstance_controller.go

Line 48 in 2d113bc

    
           serviceInstanceDefaultMaxRetries       = math.MaxInt32 // infinite number of retries

Thanks

TheBigElmo · 2024-05-07T16:11:54Z

Hi @TheBigElmo,

We followed your suggestion and kept the infinite retries behavior as default. The Maximum Retries and only enable via annotation.

cf-service-operator/internal/controllers/serviceinstance_controller.go

Line 48 in 2d113bc

serviceInstanceDefaultMaxRetries = math.MaxInt32 // infinite number of retries

Thanks

Can we set the default to - 1? So we don't change the default behavior.

bKiralyWdf · 2024-05-07T16:20:52Z

Hi @TheBigElmo,
We followed your suggestion and kept the infinite retries behavior as default. The Maximum Retries and only enable via annotation.

cf-service-operator/internal/controllers/serviceinstance_controller.go

Line 48 in 2d113bc

serviceInstanceDefaultMaxRetries = math.MaxInt32 // infinite number of retries

Thanks

Can we set the default to - 1? So we don't change the default behavior.

That would be overwritten by this 14cb132#diff-10aec153ceb471e285d9e444db581f8efd500c63fecf207f114d1ab4b746b451R421

The current code keeps the default behaviour of infinite retries, because 14cb132#diff-10aec153ceb471e285d9e444db581f8efd500c63fecf207f114d1ab4b746b451R461 checks if the retry count is changed. This functionally means that a value of math.MaxInt32 would retry indefinitely, not just math.MaxInt32 times.

RalfHammer and others added 8 commits April 23, 2024 14:15

add linting support

a26d04c

fix linting issue

db6df39

add documentation for annotations

f1eb393

Implement exponential back-off for recreation

ba34915

add generated files

e041023

use suite_test to support several integrations tests

b47d0ba

Update controll tool version and update CRDs

2ec92d1

Add the tests for Service instance controller

ee69b03

santiago-ventura requested review from TheBigElmo and cbarbian-sap April 26, 2024 08:00

santiago-ventura self-assigned this Apr 26, 2024

uwefreidank reviewed Apr 26, 2024

View reviewed changes

shilparamasamyreddy and others added 3 commits April 29, 2024 11:29

Update internal/controllers/suite_test.go

3d160f8

Co-authored-by: uwefreidank <49634966+uwefreidank@users.noreply.github.com>

Adopted the suggession for a test

5f95d0c

Add generated file api/v1alpha1/zz_generated.deepcopy.go

153ee10

reverting back the changes to original code for create instance fails…

50d7aaa

… test

santiago-ventura and others added 2 commits May 3, 2024 11:29

Make infinite retries as the default behavior

14cb132

Added test for infinite retry

2d113bc

santiago-ventura added 2 commits May 7, 2024 13:43

Adding comment for use of the reconcileTimeout

945f013

Update documentation of the annotations

8174bd6

TheBigElmo approved these changes May 10, 2024

View reviewed changes

TheBigElmo merged commit 07ce935 into main May 10, 2024
6 checks passed

TheBigElmo deleted the improve-recreation-of-instances branch May 10, 2024 10:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve recreation of instances #48

Improve recreation of instances #48

santiago-ventura commented Apr 26, 2024

uwefreidank Apr 26, 2024

shilparamasamyreddy Apr 29, 2024

RalfHammer May 2, 2024 •

edited

Loading

TheBigElmo commented Apr 29, 2024 •

edited

Loading

santiago-ventura commented Apr 30, 2024

santiago-ventura commented Apr 30, 2024

TheBigElmo commented May 2, 2024

bKiralyWdf commented May 2, 2024

santiago-ventura commented May 7, 2024

TheBigElmo commented May 7, 2024

bKiralyWdf commented May 7, 2024

Improve recreation of instances #48

Improve recreation of instances #48

Conversation

santiago-ventura commented Apr 26, 2024

Overview

Motivation and Context

Description of Changes

How Has This Been Tested?

Types of changes

Checklist

uwefreidank Apr 26, 2024

Choose a reason for hiding this comment

shilparamasamyreddy Apr 29, 2024

Choose a reason for hiding this comment

RalfHammer May 2, 2024 • edited Loading

Choose a reason for hiding this comment

TheBigElmo commented Apr 29, 2024 • edited Loading

santiago-ventura commented Apr 30, 2024

santiago-ventura commented Apr 30, 2024

TheBigElmo commented May 2, 2024

bKiralyWdf commented May 2, 2024

santiago-ventura commented May 7, 2024

TheBigElmo commented May 7, 2024

bKiralyWdf commented May 7, 2024

RalfHammer May 2, 2024 •

edited

Loading

TheBigElmo commented Apr 29, 2024 •

edited

Loading