New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix uploading EventSetup conditions from multiple CUDA streams #34725
Fix uploading EventSetup conditions from multiple CUDA streams #34725
Conversation
When multiple CUDA streams are trying to initialise the same EventSetup object, the first one to do so starts the asynchronous operations, and the others are supposed to wait for it to finish. However, code for recording the CUDA event was missing, so the other streams would find the default- constructed event, which is always "valid". Adding the missing call to record the event fixes the problem.
type bugifx |
please test |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-34725/24367
|
A new Pull Request was created by @fwyzard (Andrea Bocci) for master. It involves the following packages:
@makortel, @fwyzard can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
enable gpu |
please test |
+heterogeneous |
This pull request is fully signed and it will be integrated in one of the next master IBs after it passes the integration tests. This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy, @perrotta (and backports should be raised in the release meeting by the corresponding L2) |
-1 Failed Tests: RelVals-INPUT RelVals-INPUT
Expand to see more relval errors ...
GPU Comparison SummarySummary:
Comparison SummarySummary:
|
+1 |
merge |
PR description:
When multiple CUDA streams are trying to initialise the same EventSetup object, the first one to do so starts the asynchronous operations, and the others are supposed to wait for it to finish. However, code for recording the CUDA event was missing, so the other streams would find the default- constructed event, which is always "valid".
Adding the missing call to record the event fixes the problem.
PR validation:
Without this PR, running multiple jobs with few threads each crashes fairly soon:
With this PR, no crash is observed after a large number of tests: