-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve recreation of instances #48
Conversation
for i := 1; i <= 8; i++ { | ||
fakeSpaceClient.GetInstanceReturnsOnCall(i, kNoInstance, kNoError) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it necessary to fake exactly call 1-8, or can't we simply it like
for i := 1; i <= 8; i++ { | |
fakeSpaceClient.GetInstanceReturnsOnCall(i, kNoInstance, kNoError) | |
} | |
fakeSpaceClient.GetInstanceReturns(kNoInstance, kNoError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Uwe,
I have adopted the suggestions in 5f95d0c.
Please review. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original idea here was to let every call fail that is not expected.
The only exceptions were listed below
- call number 0 should return failed instance
- calls 1 to 8 should return empty instance
The new version of the test code will now just return no instance without an error.
I would suggest to revert to original coding.
Co-authored-by: uwefreidank <49634966+uwefreidank@users.noreply.github.com>
I couldn't find the logic behind the Exponential Back-Off feature. Could you please guide me? |
Hi, we have included the logic for the Exponential Back-Off into the HandleError function. |
Hi colleague, I understand your idea on setting a maximum requeue time for failing resources, but a maximum retry logic is a common practice as well. Exponential back-off helps manage the load by increasing delays between retries, which is useful in temporary failures. However, without a maximum retry limit, a failing resource might be continuously retried without resolution, especially if the error is serious and persistent. The retry logic and retry enable the operator to signal that recovery may be impossible after numerous attempts, such as when 10 retries fail to resolve an issue. This avoids resource wastage and helps identify problems needing manual resolution. Without a maximum retry, errors could trap resources in endless retries, preventing them from reaching a final state. |
In Kubernetes, maximum retry logic isn't common. The problem is that resources will never be ready without manual interaction. The other question is, why should you keep a failed instance? Either you expect the resource to be ready at some point, or it won't ever work, but then the best practice is to clean up such a resource. |
I agree that we should keep the default behaviour, and only cap retries if the consumer actively wants this (i.e. sets the appropriate annotation on the ressource). I would not want to set this on an operator level, because it is possible that for some instances more retries are necessary (e.g. less reliable CF service). So I would say the default behaviour defined here https://github.com/SAP/cf-service-operator/pull/48/files#diff-10aec153ceb471e285d9e444db581f8efd500c63fecf207f114d1ab4b746b451R47 among others should be changed to "infinite retries". @santiago-ventura can you please make that change? This is analogous to how Kubernetes handles job backoff limits (https://kubernetes.io/docs/tasks/job/pod-failure-policy/, https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-failure-policy) |
Hi @TheBigElmo, We followed your suggestion and kept the infinite retries behavior as default.
Thanks |
Can we set the default to - 1? So we don't change the default behavior. |
That would be overwritten by this 14cb132#diff-10aec153ceb471e285d9e444db581f8efd500c63fecf207f114d1ab4b746b451R421 The current code keeps the default behaviour of infinite retries, because 14cb132#diff-10aec153ceb471e285d9e444db581f8efd500c63fecf207f114d1ab4b746b451R461 checks if the retry count is changed. This functionally means that a value of |
Overview
This pull request introduces a series of enhancements and new features for Cloud Foundry™ Services Operator. Key improvements include the implementation of exponential back-off, enhanced error handling, and comprehensive integration tests.
Motivation and Context
The changes are necessary to ensure higher stability and reliability of the operator under various conditions, especially in production environments. Enhanced how the CF Service Operator handles failures of Cloud Foundry operations such as service instance creation and deletion failures.
Description of Changes
suite_test
.CRD Generation
, Ensured all necessary CRD files are automatically generated whenmake test-fast
is executed.clean
has been introduced to remove binary libraries that do not match the actual tool library versions.v0.14.0
.How Has This Been Tested?
The changes have been tested on Kubernetes in development and staging environments.
Types of changes
Checklist