New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle retries in SONIC #29503
Handle retries in SONIC #29503
Conversation
The code-checks are being triggered in jenkins. |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-29503/14737
|
A new Pull Request was created by @kpedro88 (Kevin Pedro) for master. It involves the following packages: HeterogeneousCore/SonicCore @makortel, @cmsbuild, @fwyzard can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
please test |
The tests are being triggered in jenkins. |
+1 |
Comparison job queued. |
Failures can be tested with unit tests as well (it does pretty much require the intermediate script though), e.g.
I'm not saying such a test should be included though. |
++tries_; | ||
//if max retries has not been exceeded, call evaluate again | ||
if (tries_ < allowedTries()) { | ||
evaluate(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should happen if the client (caller) passes an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original thought was that allowing retries should mean allowing retries in all cases, i.e. even if an exception is passed from the client.
However, maybe it's worth adding an additional, optional parameter to denote if an exception should override remaining allowed retries, in case some exceptions are more critical than others. Alternatively, the success
parameter could be change from a bool to an enum with allowed states "done", "try again", "stop". What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some further thought I find the proper behavior interesting question.
On one hand, we try to keep exceptions to signify fatal errors that should stop the job. On the other hand, it could be useful to continue retrying in some cases, but when hitting the limit throw the exception from the client. How about using LogError
/LogWarning
for the cases where a retry makes sense, and making the passed exception always lead to propagating it to WaitingTaskWithArenaHolder
immediately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using LogError/LogWarning for the cases where a retry makes sense
To make sure we're on the same page: this would just entail noting in the documentation that any exception passed to finish()
will immediately end the process*, so if a retry is desired, client developers should emit a message instead?
- and then modifying the code to enforce this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using LogError/LogWarning for the cases where a retry makes sense
To make sure we're on the same page: this would just entail noting in the documentation that any exception passed to
finish()
will immediately end the process*, so if a retry is desired, client developers should emit a message instead?
- and then modifying the code to enforce this
Yes (I think). Codewise here the finish()
should first check if eptr
is non-null, and if it is, call holder._doneWaiting(eptr)
and return. The three states would be denoted by
- "done":
success == true
, no exception - "try again":
success == false
, no exception (emit a message if useful) - "stop": exception, value of
success
is irrelevant
Comparison is ready Comparison Summary:
|
The code-checks are being triggered in jenkins. |
@makortel I've updated the logic so exceptions take precedence. For now, there is no central function to convert exceptions to messages (this seems to depend on the type of exception). In the course of further development, if some common patterns emerge for doing this, they can be migrated into a central function of the SonicClientBase class. |
please test |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-29503/14786
|
The tests are being triggered in jenkins. |
This comment is specifically for cases where the underlying RPC library throws an exception, right? |
@makortel yes, in that case the client will have to catch the exception (in the case that it should not stop execution) |
+1 |
Comparison job queued. |
Comparison is ready Comparison Summary:
|
+heterogeneous |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo (and backports should be raised in the release meeting by the corresponding L2) |
+1 |
PR description:
In some cases, the connection a remote server could be temporarily interrupted. It may be desirable to try the request again, rather than failing immediately. This PR adds that functionality to the SONIC core framework, in such a way that it can be enabled (or not) by specific clients. The documentation is also updated.
PR validation:
Separate unit tests are added for this feature.
The case where the number of failures exceeds the number of allowed tries is not included in the unit tests, as it emits an exception. It has been tested privately, and the expected output was observed: