Handle retries in SONIC #29503

kpedro88 · 2020-04-16T21:33:13Z

PR description:

In some cases, the connection a remote server could be temporarily interrupted. It may be desirable to try the request again, rather than failing immediately. This PR adds that functionality to the SONIC core framework, in such a way that it can be enabled (or not) by specific clients. The documentation is also updated.

PR validation:

Separate unit tests are added for this feature.

The case where the number of failures exceeds the number of allowed tries is not included in the unit tests, as it emits an exception. It has been tested privately, and the expected output was observed:

----- Begin Fatal Exception 10-Apr-2020 14:34:40 CDT-----------------------
An exception of category 'SonicCallFailed' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 0
   [1] Running path 'p3'
   [2] Prefetching for module IntTestAnalyzer/'testerAsync'
   [3] Calling method for module SonicDummyProducerAsync/'dummyAsync'
Exception Message:
call failed after max 3 tries
----- End Fatal Exception -------------------------------------------------

cmsbuild · 2020-04-16T21:33:40Z

The code-checks are being triggered in jenkins.

cmsbuild · 2020-04-16T21:41:27Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-29503/14737

This PR adds an extra 16KB to repository

cmsbuild · 2020-04-16T21:41:51Z

A new Pull Request was created by @kpedro88 (Kevin Pedro) for master.

It involves the following packages:

HeterogeneousCore/SonicCore

@makortel, @cmsbuild, @fwyzard can you please review it and eventually sign? Thanks.
@makortel, @rovere this is something you requested to watch as well.
@silviodonato, @dpiparo you are the release manager for this.

cms-bot commands are listed here

kpedro88 · 2020-04-16T21:43:48Z

please test

cmsbuild · 2020-04-16T21:44:04Z

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/5740/console Started: 2020/04/16 23:44

cmsbuild · 2020-04-16T22:54:18Z

+1
Tested at: 50da7dc
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-d416be/5740/summary.html
CMSSW: CMSSW_11_1_X_2020-04-16-1100
SCRAM_ARCH: slc7_amd64_gcc820

cmsbuild · 2020-04-16T22:54:21Z

Comparison job queued.

makortel · 2020-04-16T23:13:00Z

The case where the number of failures exceeds the number of allowed tries is not included in the unit tests, as it emits an exception. It has been tested privately, and the expected output was observed:

Failures can be tested with unit tests as well (it does pretty much require the intermediate script though), e.g.

cmssw/FWCore/Integration/test/run_TestSwitchProducer.sh

Line 68 in 8404c99

    
           cmsRun -n ${NUMTHREADS} ${LOCAL_TEST_DIR}/${test}AliasOutput_cfg.py && die "cmsRun ${test}AliasOutput_cfg.py did not throw an exception" $?

I'm not saying such a test should be included though.

makortel · 2020-04-16T23:18:02Z

HeterogeneousCore/SonicCore/interface/SonicClientBase.h

+      ++tries_;
+      //if max retries has not been exceeded, call evaluate again
+      if (tries_ < allowedTries()) {
+        evaluate();


What should happen if the client (caller) passes an exception?

My original thought was that allowing retries should mean allowing retries in all cases, i.e. even if an exception is passed from the client.

However, maybe it's worth adding an additional, optional parameter to denote if an exception should override remaining allowed retries, in case some exceptions are more critical than others. Alternatively, the success parameter could be change from a bool to an enum with allowed states "done", "try again", "stop". What do you think?

After some further thought I find the proper behavior interesting question.

On one hand, we try to keep exceptions to signify fatal errors that should stop the job. On the other hand, it could be useful to continue retrying in some cases, but when hitting the limit throw the exception from the client. How about using LogError/LogWarning for the cases where a retry makes sense, and making the passed exception always lead to propagating it to WaitingTaskWithArenaHolder immediately?

How about using LogError/LogWarning for the cases where a retry makes sense

To make sure we're on the same page: this would just entail noting in the documentation that any exception passed to finish() will immediately end the process*, so if a retry is desired, client developers should emit a message instead?

and then modifying the code to enforce this

How about using LogError/LogWarning for the cases where a retry makes sense

To make sure we're on the same page: this would just entail noting in the documentation that any exception passed to finish() will immediately end the process*, so if a retry is desired, client developers should emit a message instead?

and then modifying the code to enforce this

Yes (I think). Codewise here the finish() should first check if eptr is non-null, and if it is, call holder._doneWaiting(eptr) and return. The three states would be denoted by

"done": success == true, no exception

"try again": success == false, no exception (emit a message if useful)

"stop": exception, value of success is irrelevant

cmsbuild · 2020-04-17T00:20:17Z

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-d416be/5740/summary.html

Comparison Summary:

No significant changes to the logs found
Reco comparison results: 4 differences found in the comparisons
DQMHistoTests: Total files compared: 34
DQMHistoTests: Total histograms compared: 2696435
DQMHistoTests: Total failures: 26
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 2696090
DQMHistoTests: Total skipped: 319
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 33 files compared)
Checked 147 log files, 16 edm output root files, 34 DQM output files

cmsbuild · 2020-04-20T16:50:44Z

The code-checks are being triggered in jenkins.

kpedro88 · 2020-04-20T16:52:03Z

@makortel I've updated the logic so exceptions take precedence. For now, there is no central function to convert exceptions to messages (this seems to depend on the type of exception). In the course of further development, if some common patterns emerge for doing this, they can be migrated into a central function of the SonicClientBase class.

kpedro88 · 2020-04-20T16:52:12Z

please test

cmsbuild · 2020-04-20T16:55:15Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-29503/14786

This PR adds an extra 20KB to repository

cmsbuild · 2020-04-20T16:55:39Z

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/5788/console Started: 2020/04/20 18:57

cmsbuild · 2020-04-20T16:55:40Z

Pull request #29503 was updated. @makortel, @cmsbuild, @fwyzard can you please check and sign again.

makortel · 2020-04-20T17:51:43Z

For now, there is no central function to convert exceptions to messages (this seems to depend on the type of exception). In the course of further development, if some common patterns emerge for doing this, they can be migrated into a central function of the SonicClientBase class.

This comment is specifically for cases where the underlying RPC library throws an exception, right?

kpedro88 · 2020-04-20T17:53:17Z

@makortel yes, in that case the client will have to catch the exception (in the case that it should not stop execution)

cmsbuild · 2020-04-20T18:08:02Z

+1
Tested at: 225fd52
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-d416be/5788/summary.html
CMSSW: CMSSW_11_1_X_2020-04-20-1100
SCRAM_ARCH: slc7_amd64_gcc820

cmsbuild · 2020-04-20T18:08:05Z

Comparison job queued.

cmsbuild · 2020-04-20T19:35:02Z

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-d416be/5788/summary.html

Comparison Summary:

No significant changes to the logs found
Reco comparison results: 4 differences found in the comparisons
DQMHistoTests: Total files compared: 34
DQMHistoTests: Total histograms compared: 2696435
DQMHistoTests: Total failures: 2
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 2696114
DQMHistoTests: Total skipped: 319
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 33 files compared)
Checked 147 log files, 16 edm output root files, 34 DQM output files

makortel · 2020-04-20T23:18:31Z

+heterogeneous

cmsbuild · 2020-04-20T23:18:55Z

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo (and backports should be raised in the release meeting by the corresponding L2)

silviodonato · 2020-04-21T08:17:38Z

+1

kpedro88 added 5 commits April 10, 2020 15:15

handle retries in client base class and test them

707343a

describe retries in readme

b55a939

apply "code formatting" to readme snippets

cec60ed

code format

e09881e

add separate retry tests

50da7dc

cmsbuild added this to the CMSSW_11_1_X milestone Apr 16, 2020

cmsbuild added code-checks-pending comparison-pending heterogeneous-pending orp-pending pending-signatures tests-pending labels Apr 16, 2020

cmsbuild added code-checks-approved and removed code-checks-pending labels Apr 16, 2020

cmsbuild added tests-started and removed tests-pending labels Apr 16, 2020

cmsbuild added tests-approved and removed tests-started labels Apr 16, 2020

makortel reviewed Apr 16, 2020

View reviewed changes

cmsbuild added comparison-available and removed comparison-pending labels Apr 17, 2020

cmsbuild added code-checks-approved and removed code-checks-pending labels Apr 20, 2020

cmsbuild added tests-started and removed tests-pending labels Apr 20, 2020

cmsbuild added tests-approved and removed tests-started labels Apr 20, 2020

cmsbuild added comparison-available and removed comparison-pending labels Apr 20, 2020

cmsbuild added fully-signed heterogeneous-approved and removed heterogeneous-pending pending-signatures labels Apr 20, 2020

cmsbuild added orp-approved and removed orp-pending labels Apr 21, 2020

cmsbuild merged commit 5141b55 into cms-sw:master Apr 21, 2020

kpedro88 mentioned this pull request Jun 19, 2020

Sonic client for Triton server #30318

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle retries in SONIC #29503

Handle retries in SONIC #29503

kpedro88 commented Apr 16, 2020

cmsbuild commented Apr 16, 2020

cmsbuild commented Apr 16, 2020

cmsbuild commented Apr 16, 2020

kpedro88 commented Apr 16, 2020

cmsbuild commented Apr 16, 2020 •

edited

cmsbuild commented Apr 16, 2020

cmsbuild commented Apr 16, 2020

makortel commented Apr 16, 2020

makortel Apr 16, 2020

kpedro88 Apr 17, 2020

makortel Apr 17, 2020

kpedro88 Apr 17, 2020

makortel Apr 18, 2020

cmsbuild commented Apr 17, 2020

cmsbuild commented Apr 20, 2020

kpedro88 commented Apr 20, 2020

kpedro88 commented Apr 20, 2020

cmsbuild commented Apr 20, 2020

cmsbuild commented Apr 20, 2020 •

edited

cmsbuild commented Apr 20, 2020

makortel commented Apr 20, 2020

kpedro88 commented Apr 20, 2020

cmsbuild commented Apr 20, 2020

cmsbuild commented Apr 20, 2020

cmsbuild commented Apr 20, 2020

makortel commented Apr 20, 2020

cmsbuild commented Apr 20, 2020

silviodonato commented Apr 21, 2020

Handle retries in SONIC #29503

Handle retries in SONIC #29503

Conversation

kpedro88 commented Apr 16, 2020

PR description:

PR validation:

cmsbuild commented Apr 16, 2020

cmsbuild commented Apr 16, 2020

cmsbuild commented Apr 16, 2020

kpedro88 commented Apr 16, 2020

cmsbuild commented Apr 16, 2020 • edited

cmsbuild commented Apr 16, 2020

cmsbuild commented Apr 16, 2020

makortel commented Apr 16, 2020

makortel Apr 16, 2020

Choose a reason for hiding this comment

kpedro88 Apr 17, 2020

Choose a reason for hiding this comment

makortel Apr 17, 2020

Choose a reason for hiding this comment

kpedro88 Apr 17, 2020

Choose a reason for hiding this comment

makortel Apr 18, 2020

Choose a reason for hiding this comment

cmsbuild commented Apr 17, 2020

cmsbuild commented Apr 20, 2020

kpedro88 commented Apr 20, 2020

kpedro88 commented Apr 20, 2020

cmsbuild commented Apr 20, 2020

cmsbuild commented Apr 20, 2020 • edited

cmsbuild commented Apr 20, 2020

makortel commented Apr 20, 2020

kpedro88 commented Apr 20, 2020

cmsbuild commented Apr 20, 2020

cmsbuild commented Apr 20, 2020

cmsbuild commented Apr 20, 2020

makortel commented Apr 20, 2020

cmsbuild commented Apr 20, 2020

silviodonato commented Apr 21, 2020

cmsbuild commented Apr 16, 2020 •

edited

cmsbuild commented Apr 20, 2020 •

edited