New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Various Patatrack kernels #35835
Conversation
spelling Co-authored-by: Slava Krutelyov <slava77@gmail.com>
enable gpu |
@cmsbuild, please test |
@@ -135,6 +135,11 @@ namespace gpuClustering { | |||
#ifdef __CUDA_ARCH__ | |||
// assume that we can cover the whole module with up to 16 blockDim.x-wide iterations | |||
constexpr int maxiter = 16; | |||
if (threadIdx.x == 0 && (hist.size() / blockDim.x) >= maxiter) | |||
printf("THIS IS NOT SUPPOSED TO HAPPEN too many hits in module %d: %d for block size %d\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is an assert below but it's compiled away!
We should fix this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add __trap()
here then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we wiil have a Heterogeneous version, yes.
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35835/26200
|
@cmsbuild please test to pick up cms-sw/cms-bot#1651 for reco comparisons |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ecaae3/19962/summary.html GPU Comparison SummarySummary:
Comparison SummarySummary:
|
+reconstruction
|
+db |
+heterogeneous @fwyzard said he could run timing measurements after signing. |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2) |
+1 |
@@ -107,9 +107,9 @@ __global__ void kernel_checkOverflows(HitContainer const *foundNtuplets, | |||
printf("Tracks overflow %d in %d\n", idx, thisCell.layerPairId()); | |||
if (thisCell.isKilled()) | |||
atomicAdd(&c.nKilledCells, 1); | |||
if (thisCell.unused()) | |||
if (!thisCell.unused()) | |||
atomicAdd(&c.nEmptyCells, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't nEmptyCells
be renamed to nUsedCells
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I count now "used" because they are much less (so faster). Then in the report I do "1"-.
} | ||
|
||
if (params_.doStats_) { | ||
// counters (add flag???) | ||
std::lock_guard guard(lock_stat); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need two separate scopes, instead of just one ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
historical
// cuda atomics are NOT atomics on CPU so protect stat update with a mutex | ||
// waiting for a more general solution (incuding multiple devices) to be proposed and implemented | ||
std::mutex lock_stat; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the cpu case, what is doing concurrent updates of the same stat ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean gpu?
On gpu atomics on device memory is used. Multiple GPUs will crash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on cpu (w/o mutex) was obviously producing wrong (lower) results (as expected)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe for the next iterations, just to make code look less impenetrable.
float vcal = float(adc[i]) * gain - pedestal * gain; | ||
if constexpr (isRun2) { | ||
float conversionFactor = id[i] < 96 ? VCaltoElectronGain_L1 : VCaltoElectronGain; | ||
float offset = id[i] < 96 ? VCaltoElectronOffset_L1 : VCaltoElectronOffset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
96 -> phase1PixelTopology::layerStart[1]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a issue open to fix this everywhere (phase1PixelTopology::layerStart[1] may not compile, or compile and produce wrong results)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not compile and/or produce wrong results? Can you point me to the issue, (I might have missed it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float offset = id[i] < 96 ? VCaltoElectronOffset_L1 : VCaltoElectronOffset; | ||
vcal = vcal * conversionFactor + offset; | ||
} | ||
adc[i] = std::max(100, int(vcal)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cms-sw/trk-dpg-l2 would be nice if this 100 is taken from the same place as the other famous 100 in the regular clusterizer.
@VinInn did you run any checks of the performance or throughput with these changes ? |
before the last commit, yes. |
Simplify and improve the logic of various Kernels.
Some maths have been sync with CPU version.
Fixed and improved the statistics collection and printing (off by default).
Technical.
No regression observed: math has changed so some regression cannot be excluded (even in CPU wf).
include bug fix for modules with large occupancy
Superseeds #35598