New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition in EventProcessor #26289
Conversation
First note that that vast majority of the time, the correct code path wins the data race and there is no problem, almost always. That is why we have rarely seen failures related to this. The condition is related to the deletion of the LuminosityBlockProcessingStatus. If we want to call endRun after exiting processLumis, then this status object must have already been destroyed because it holds a shared_ptr to the RunResources object that prevents endRun from running. If it isn't destroyed then endRun may get called at some random time much later which causes all kinds of problems. Seg faults would be likely.
The code-checks are being triggered in jenkins. |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-26289/8976
|
A new Pull Request was created by @wddgit (W. David Dagenhart) for master. It involves the following packages: FWCore/Framework @cmsbuild, @smuzaffar, @Dr15Jones can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
please test |
The tests are being triggered in jenkins. |
@@ -1458,23 +1478,22 @@ namespace edm { | |||
FDEBUG(1) << "\tdeleteRunFromCache " << run << "\n"; | |||
} | |||
|
|||
void EventProcessor::writeLumiAsync(WaitingTaskHolder task, std::shared_ptr<LuminosityBlockProcessingStatus> iStatus) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is stopping the LuminosityBlock from being deleted before the task created by writeLumiAsync finishes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't the shared_ptr be destroyed, the LuminosityBlock be returned the the ReusableObjectHolder and another lumi transition pick that object up and reset it while the writLumi is happening?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the LuminosityBlockProcessingStatus also prevent the LimitedTaskQueue from starting a new task? If the status object is deleted before writeLumi finishes, then another Lumi will start before we've finished with the one we are processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now, the task passed into writeLumiAsync actually is keeping LuminosityBlockProcessingStatus alive. I withdraw my objection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look in EventProcessor::globalEndLumiAsync at the first make_waiting_task (which is the one I did not modify), inside that task is a line "status.reset();". Until that point there is a shared_ptr pointer holding onto the status object somewhere and keeping it alive. Part of the time they are in a vector held as a data member by EventProcessor. A lot of other shared_ptr's are really unnecessary for keeping it alive. Some of what I did was remove some of the unnecessary shared_ptr so I did not have to worry about making sure they got deleted soon enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note one thing that complicates this is that the destruction order of captured variables is not specified by the C++ standard. Same for function arguments.
Also note, sometimes I used a move on the shared_ptr so that it was already null when it hit the end of a function or lambda. I had to make a few modifications so I could do the moves.
//task to start the global begin lumi | ||
WaitingTaskHolder beginStreamsHolder{beginStreamsTask}; | ||
auto const& es = esp_->eventSetup(); | ||
{ | ||
typedef OccurrenceTraits<LuminosityBlockPrincipal, BranchActionGlobalBegin> Traits; | ||
beginGlobalTransitionAsync<Traits>(beginStreamsHolder, | ||
*schedule_, | ||
*(status->lumiPrincipal()), | ||
lumiPrincipal, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is just a clean up where an already existing local variable is used instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is more than that. Because I moved status into beginStreamTask, it is already null at this point.
Comparison job queued. |
Comparison is ready Comparison Summary:
|
Hold. I think I see a way to simplify this a little more so I will update part of the fix. Also, one more piece of information to consider if we are thinking about backporting this. LuminosityBlockProcessingStatus did not exist before release series 10_1_X, so this problem could not exist before that. And the problem is in 10_1_X. But again experience shows this is a very very rare problem ... |
Simply the function globalEndLumiAsync so it is more clear to someone reading or maintaining the code who owns the lumi status object. The status object is now moved into the first task instead of copied. It is no longer moved into the second task, which never really needed the shared_ptr in the first place. I also edited the comments a bit and renamed a few things to make this part of the code more clear. This commit should not change behavior in any way.
The code-checks are being triggered in jenkins. |
unhold |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-26289/8992
|
please test |
Pull request #26289 was updated. @cmsbuild, @smuzaffar, @Dr15Jones can you please check and sign again. |
The tests are being triggered in jenkins. |
Comparison job queued. |
Comparison is ready Comparison Summary:
|
+1 |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @davidlange6, @slava77, @smuzaffar, @fabiocos (and backports should be raised in the release meeting by the corresponding L2) |
+1 |
PR description:
This fixes the problem from issue #26246. This is related
to a data race condition in EventProcessor. Almost always,
the correct data path wins the race and there are no
problems. In these cases, there should be no differences
in the output with or without this PR. But there are very rare
cases where the wrong code path wins the data race, then
endRun gets called late and this would most likely cause a
seg fault or very bad failure of some unpredictable nature.
The condition is related to the deletion of the
LuminosityBlockProcessingStatus object. If we want
to call endRun after exiting processLumis, then
this status object must have already been destroyed
because it holds a shared_ptr to the RunResources
object that prevents endRun from running. If it isn't
destroyed then endRun may get called at some random
time much later which causes serious problems.
The writeT task in EventProcessor::globalEndLumiAsync
is the one that caused the failure in the issue. Although
there were several similar issues that this also fixes that could
result in the same problem.
PR validation:
It is hard to write a test for this, because it is a rare non reproducible problem. Even without the fix, almost always the right code path wins and there is absolutely no difference in how things execute in that case.
if this PR is a backport please specify the original PR:
Potentially I could backport this. These failures are very rare, so it is a judgement call whether it is worth the effort. Probably it was introduced when concurrent lumis were implemented.