Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve recreation of instances #48

Merged
merged 16 commits into from
May 10, 2024
Merged

Conversation

santiago-ventura
Copy link
Contributor

Overview

This pull request introduces a series of enhancements and new features for Cloud Foundry™ Services Operator. Key improvements include the implementation of exponential back-off, enhanced error handling, and comprehensive integration tests.

Motivation and Context

The changes are necessary to ensure higher stability and reliability of the operator under various conditions, especially in production environments. Enhanced how the CF Service Operator handles failures of Cloud Foundry operations such as service instance creation and deletion failures.

Description of Changes

  • Exponential Back-Off: Introduced logic for exponential back-off in service instance recreation, enhancing reconciliation reliability.
  • Retry Limits: Established a maximum number of retries for service instance recreation to prevent endless loops by introducing an annotation.
  • Reconcile Timeout: Added new logic to make the time interval between reconcile configurable as annotations
  • Test Suite:
    • Added integration tests using suite_test.
    • Introduce the counterfeiter library to mock CF API request.
  • Integration Tests: Developed specific integration tests to validate different scenarios for the creation and re-creation of CF Service instances and spaces.
  • Makefile Changes:
    • CRD Generation, Ensured all necessary CRD files are automatically generated when make test-fast is executed.
    • A new make command clean has been introduced to remove binary libraries that do not match the actual tool library versions.
    • Updated Controller Tool library version to v0.14.0.
  • Linting Support: Implemented linting using ESLint to enforce code quality standards.
  • Documentation: Updated detailed documentation for annotations related to these new features within the operator.

How Has This Been Tested?

The changes have been tested on Kubernetes in development and staging environments.

Types of changes

  • New feature
  • Tests
  • Documentation update

Checklist

  • My code follows the code style of this project.
  • I have added tests to cover my changes.
  • All new tests passed.
  • I have updated the documentation accordingly.

Comment on lines +148 to +150
for i := 1; i <= 8; i++ {
fakeSpaceClient.GetInstanceReturnsOnCall(i, kNoInstance, kNoError)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary to fake exactly call 1-8, or can't we simply it like

Suggested change
for i := 1; i <= 8; i++ {
fakeSpaceClient.GetInstanceReturnsOnCall(i, kNoInstance, kNoError)
}
fakeSpaceClient.GetInstanceReturns(kNoInstance, kNoError)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Uwe,

I have adopted the suggestions in 5f95d0c.
Please review. Thanks.

Copy link
Contributor

@RalfHammer RalfHammer May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original idea here was to let every call fail that is not expected.
The only exceptions were listed below

  • call number 0 should return failed instance
  • calls 1 to 8 should return empty instance

The new version of the test code will now just return no instance without an error.
I would suggest to revert to original coding.

internal/controllers/suite_test.go Show resolved Hide resolved
@TheBigElmo
Copy link
Contributor

TheBigElmo commented Apr 29, 2024

I couldn't find the logic behind the Exponential Back-Off feature. Could you please guide me?
If exponential back-off is achieved, it will be possible to set the maximum requeue time for failing resources to something like 30 or 60 minutes. This would eliminate the need for a maximum retry logic and the controller would remain the model of eventual consistency of its resources.

@santiago-ventura
Copy link
Contributor Author

I couldn't find the logic behind the Exponential Back-Off feature. Could you please guide me? If exponential back-off is achieved, it will be possible to set the maximum requeue time for failing resources to something like 30 or 60 minutes. This would eliminate the need for a maximum retry logic and the controller would remain the model of eventual consistency of its resources.

Hi, we have included the logic for the Exponential Back-Off into the HandleError function.

@santiago-ventura
Copy link
Contributor Author

I couldn't find the logic behind the Exponential Back-Off feature. Could you please guide me? If exponential back-off is achieved, it will be possible to set the maximum requeue time for failing resources to something like 30 or 60 minutes. This would eliminate the need for a maximum retry logic and the controller would remain the model of eventual consistency of its resources.

Hi colleague,

I understand your idea on setting a maximum requeue time for failing resources, but a maximum retry logic is a common practice as well. Exponential back-off helps manage the load by increasing delays between retries, which is useful in temporary failures. However, without a maximum retry limit, a failing resource might be continuously retried without resolution, especially if the error is serious and persistent.

The retry logic and retry enable the operator to signal that recovery may be impossible after numerous attempts, such as when 10 retries fail to resolve an issue. This avoids resource wastage and helps identify problems needing manual resolution.

Without a maximum retry, errors could trap resources in endless retries, preventing them from reaching a final state.

@TheBigElmo
Copy link
Contributor

I couldn't find the logic behind the Exponential Back-Off feature. Could you please guide me? If exponential back-off is achieved, it will be possible to set the maximum requeue time for failing resources to something like 30 or 60 minutes. This would eliminate the need for a maximum retry logic and the controller would remain the model of eventual consistency of its resources.

Hi colleague,

I understand your idea on setting a maximum requeue time for failing resources, but a maximum retry logic is a common practice as well. Exponential back-off helps manage the load by increasing delays between retries, which is useful in temporary failures. However, without a maximum retry limit, a failing resource might be continuously retried without resolution, especially if the error is serious and persistent.

The retry logic and retry enable the operator to signal that recovery may be impossible after numerous attempts, such as when 10 retries fail to resolve an issue. This avoids resource wastage and helps identify problems needing manual resolution.

Without a maximum retry, errors could trap resources in endless retries, preventing them from reaching a final state.

Hi @santiago-ventura,

In Kubernetes, maximum retry logic isn't common. The problem is that resources will never be ready without manual interaction. The other question is, why should you keep a failed instance? Either you expect the resource to be ready at some point, or it won't ever work, but then the best practice is to clean up such a resource.
Which leads to the following suggestion: if you still need such a behavior, please disable it by default and make it possible to start the operator with an additional parameter to enable it and allow to override the default maximum number of retries.

@bKiralyWdf
Copy link
Contributor

I couldn't find the logic behind the Exponential Back-Off feature. Could you please guide me? If exponential back-off is achieved, it will be possible to set the maximum requeue time for failing resources to something like 30 or 60 minutes. This would eliminate the need for a maximum retry logic and the controller would remain the model of eventual consistency of its resources.

Hi colleague,
I understand your idea on setting a maximum requeue time for failing resources, but a maximum retry logic is a common practice as well. Exponential back-off helps manage the load by increasing delays between retries, which is useful in temporary failures. However, without a maximum retry limit, a failing resource might be continuously retried without resolution, especially if the error is serious and persistent.
The retry logic and retry enable the operator to signal that recovery may be impossible after numerous attempts, such as when 10 retries fail to resolve an issue. This avoids resource wastage and helps identify problems needing manual resolution.
Without a maximum retry, errors could trap resources in endless retries, preventing them from reaching a final state.

Hi @santiago-ventura,

In Kubernetes, maximum retry logic isn't common. The problem is that resources will never be ready without manual interaction. The other question is, why should you keep a failed instance? Either you expect the resource to be ready at some point, or it won't ever work, but then the best practice is to clean up such a resource. Which leads to the following suggestion: if you still need such a behavior, please disable it by default and make it possible to start the operator with an additional parameter to enable it and allow to override the default maximum number of retries.

I agree that we should keep the default behaviour, and only cap retries if the consumer actively wants this (i.e. sets the appropriate annotation on the ressource). I would not want to set this on an operator level, because it is possible that for some instances more retries are necessary (e.g. less reliable CF service). So I would say the default behaviour defined here https://github.com/SAP/cf-service-operator/pull/48/files#diff-10aec153ceb471e285d9e444db581f8efd500c63fecf207f114d1ab4b746b451R47 among others should be changed to "infinite retries". @santiago-ventura can you please make that change?

This is analogous to how Kubernetes handles job backoff limits (https://kubernetes.io/docs/tasks/job/pod-failure-policy/, https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-failure-policy)

@santiago-ventura
Copy link
Contributor Author

Hi @TheBigElmo,

We followed your suggestion and kept the infinite retries behavior as default.
The Maximum Retries and only enable via annotation.

serviceInstanceDefaultMaxRetries = math.MaxInt32 // infinite number of retries

Thanks

@TheBigElmo
Copy link
Contributor

Hi @TheBigElmo,

We followed your suggestion and kept the infinite retries behavior as default. The Maximum Retries and only enable via annotation.

serviceInstanceDefaultMaxRetries = math.MaxInt32 // infinite number of retries

Thanks

Can we set the default to - 1? So we don't change the default behavior.

@bKiralyWdf
Copy link
Contributor

Hi @TheBigElmo,
We followed your suggestion and kept the infinite retries behavior as default. The Maximum Retries and only enable via annotation.

serviceInstanceDefaultMaxRetries = math.MaxInt32 // infinite number of retries

Thanks

Can we set the default to - 1? So we don't change the default behavior.

That would be overwritten by this 14cb132#diff-10aec153ceb471e285d9e444db581f8efd500c63fecf207f114d1ab4b746b451R421

The current code keeps the default behaviour of infinite retries, because 14cb132#diff-10aec153ceb471e285d9e444db581f8efd500c63fecf207f114d1ab4b746b451R461 checks if the retry count is changed. This functionally means that a value of math.MaxInt32 would retry indefinitely, not just math.MaxInt32 times.

@TheBigElmo TheBigElmo merged commit 07ce935 into main May 10, 2024
6 checks passed
@TheBigElmo TheBigElmo deleted the improve-recreation-of-instances branch May 10, 2024 10:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants