Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle retries in SONIC #29503

Merged
merged 6 commits into from Apr 21, 2020
Merged

Handle retries in SONIC #29503

merged 6 commits into from Apr 21, 2020

Conversation

kpedro88
Copy link
Contributor

PR description:

In some cases, the connection a remote server could be temporarily interrupted. It may be desirable to try the request again, rather than failing immediately. This PR adds that functionality to the SONIC core framework, in such a way that it can be enabled (or not) by specific clients. The documentation is also updated.

PR validation:

Separate unit tests are added for this feature.

The case where the number of failures exceeds the number of allowed tries is not included in the unit tests, as it emits an exception. It has been tested privately, and the expected output was observed:

----- Begin Fatal Exception 10-Apr-2020 14:34:40 CDT-----------------------
An exception of category 'SonicCallFailed' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 0
   [1] Running path 'p3'
   [2] Prefetching for module IntTestAnalyzer/'testerAsync'
   [3] Calling method for module SonicDummyProducerAsync/'dummyAsync'
Exception Message:
call failed after max 3 tries
----- End Fatal Exception -------------------------------------------------

@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-29503/14737

  • This PR adds an extra 16KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @kpedro88 (Kevin Pedro) for master.

It involves the following packages:

HeterogeneousCore/SonicCore

@makortel, @cmsbuild, @fwyzard can you please review it and eventually sign? Thanks.
@makortel, @rovere this is something you requested to watch as well.
@silviodonato, @dpiparo you are the release manager for this.

cms-bot commands are listed here

@kpedro88
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 16, 2020

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/5740/console Started: 2020/04/16 23:44

@cmsbuild
Copy link
Contributor

+1
Tested at: 50da7dc
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-d416be/5740/summary.html
CMSSW: CMSSW_11_1_X_2020-04-16-1100
SCRAM_ARCH: slc7_amd64_gcc820

@cmsbuild
Copy link
Contributor

Comparison job queued.

@makortel
Copy link
Contributor

The case where the number of failures exceeds the number of allowed tries is not included in the unit tests, as it emits an exception. It has been tested privately, and the expected output was observed:

Failures can be tested with unit tests as well (it does pretty much require the intermediate script though), e.g.

cmsRun -n ${NUMTHREADS} ${LOCAL_TEST_DIR}/${test}AliasOutput_cfg.py && die "cmsRun ${test}AliasOutput_cfg.py did not throw an exception" $?

I'm not saying such a test should be included though.

++tries_;
//if max retries has not been exceeded, call evaluate again
if (tries_ < allowedTries()) {
evaluate();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should happen if the client (caller) passes an exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original thought was that allowing retries should mean allowing retries in all cases, i.e. even if an exception is passed from the client.

However, maybe it's worth adding an additional, optional parameter to denote if an exception should override remaining allowed retries, in case some exceptions are more critical than others. Alternatively, the success parameter could be change from a bool to an enum with allowed states "done", "try again", "stop". What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some further thought I find the proper behavior interesting question.

On one hand, we try to keep exceptions to signify fatal errors that should stop the job. On the other hand, it could be useful to continue retrying in some cases, but when hitting the limit throw the exception from the client. How about using LogError/LogWarning for the cases where a retry makes sense, and making the passed exception always lead to propagating it to WaitingTaskWithArenaHolder immediately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using LogError/LogWarning for the cases where a retry makes sense

To make sure we're on the same page: this would just entail noting in the documentation that any exception passed to finish() will immediately end the process*, so if a retry is desired, client developers should emit a message instead?

  • and then modifying the code to enforce this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using LogError/LogWarning for the cases where a retry makes sense

To make sure we're on the same page: this would just entail noting in the documentation that any exception passed to finish() will immediately end the process*, so if a retry is desired, client developers should emit a message instead?

  • and then modifying the code to enforce this

Yes (I think). Codewise here the finish() should first check if eptr is non-null, and if it is, call holder._doneWaiting(eptr) and return. The three states would be denoted by

  • "done": success == true, no exception
  • "try again": success == false, no exception (emit a message if useful)
  • "stop": exception, value of success is irrelevant

@cmsbuild
Copy link
Contributor

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-d416be/5740/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 34
  • DQMHistoTests: Total histograms compared: 2696435
  • DQMHistoTests: Total failures: 26
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2696090
  • DQMHistoTests: Total skipped: 319
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 33 files compared)
  • Checked 147 log files, 16 edm output root files, 34 DQM output files

@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@kpedro88
Copy link
Contributor Author

@makortel I've updated the logic so exceptions take precedence. For now, there is no central function to convert exceptions to messages (this seems to depend on the type of exception). In the course of further development, if some common patterns emerge for doing this, they can be migrated into a central function of the SonicClientBase class.

@kpedro88
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-29503/14786

  • This PR adds an extra 20KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 20, 2020

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/5788/console Started: 2020/04/20 18:57

@cmsbuild
Copy link
Contributor

Pull request #29503 was updated. @makortel, @cmsbuild, @fwyzard can you please check and sign again.

@makortel
Copy link
Contributor

For now, there is no central function to convert exceptions to messages (this seems to depend on the type of exception). In the course of further development, if some common patterns emerge for doing this, they can be migrated into a central function of the SonicClientBase class.

This comment is specifically for cases where the underlying RPC library throws an exception, right?

@kpedro88
Copy link
Contributor Author

@makortel yes, in that case the client will have to catch the exception (in the case that it should not stop execution)

@cmsbuild
Copy link
Contributor

+1
Tested at: 225fd52
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-d416be/5788/summary.html
CMSSW: CMSSW_11_1_X_2020-04-20-1100
SCRAM_ARCH: slc7_amd64_gcc820

@cmsbuild
Copy link
Contributor

Comparison job queued.

@cmsbuild
Copy link
Contributor

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-d416be/5788/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 34
  • DQMHistoTests: Total histograms compared: 2696435
  • DQMHistoTests: Total failures: 2
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2696114
  • DQMHistoTests: Total skipped: 319
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 33 files compared)
  • Checked 147 log files, 16 edm output root files, 34 DQM output files

@makortel
Copy link
Contributor

+heterogeneous

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo (and backports should be raised in the release meeting by the corresponding L2)

@silviodonato
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 5141b55 into cms-sw:master Apr 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants