Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Various Patatrack kernels #35835

Merged
merged 14 commits into from Oct 27, 2021
Merged

Improve Various Patatrack kernels #35835

merged 14 commits into from Oct 27, 2021

Conversation

VinInn
Copy link
Contributor

@VinInn VinInn commented Oct 25, 2021

Simplify and improve the logic of various Kernels.
Some maths have been sync with CPU version.
Fixed and improved the statistics collection and printing (off by default).

Technical.
No regression observed: math has changed so some regression cannot be excluded (even in CPU wf).

include bug fix for modules with large occupancy

Superseeds #35598

@VinInn
Copy link
Contributor Author

VinInn commented Oct 25, 2021

enable gpu

@VinInn
Copy link
Contributor Author

VinInn commented Oct 25, 2021

@cmsbuild, please test

@@ -135,6 +135,11 @@ namespace gpuClustering {
#ifdef __CUDA_ARCH__
// assume that we can cover the whole module with up to 16 blockDim.x-wide iterations
constexpr int maxiter = 16;
if (threadIdx.x == 0 && (hist.size() / blockDim.x) >= maxiter)
printf("THIS IS NOT SUPPOSED TO HAPPEN too many hits in module %d: %d for block size %d\n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is an assert below but it's compiled away!
We should fix this!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add __trap() here then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we wiil have a Heterogeneous version, yes.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35835/26200

@slava77
Copy link
Contributor

slava77 commented Oct 26, 2021

@cmsbuild please test

to pick up cms-sw/cms-bot#1651 for reco comparisons

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ecaae3/19962/summary.html
COMMIT: 58fd079
CMSSW: CMSSW_12_1_X_2021-10-26-1100/slc7_amd64_gcc900
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/35835/19962/install.sh to create a dev area with all the needed externals and cmssw changes.

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19782
  • DQMHistoTests: Total failures: 16
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 19766
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: no differences found

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 41
  • DQMHistoTests: Total histograms compared: 2797338
  • DQMHistoTests: Total failures: 5
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 2797310
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -0.004 KiB( 40 files compared)
  • DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
  • Checked 173 log files, 37 edm output root files, 41 DQM output files
  • TriggerResults: no differences found

@slava77
Copy link
Contributor

slava77 commented Oct 26, 2021

+reconstruction

for #35835 58fd079

  • code changes are essentially technical, in line with the PR description and the follow up review initially done in Improve various Patatrack Kernels #35598
  • jenkins tests pass and comparisons with the baseline show no relevant differences.
    • DQM differences appearing in 11634.506 are apparently at numerical precision level and appear just in various profile histogram fits

@ggovi
Copy link
Contributor

ggovi commented Oct 27, 2021

+db

@makortel
Copy link
Contributor

+heterogeneous

@fwyzard said he could run timing measurements after signing.

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@perrotta
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 69566d9 into cms-sw:master Oct 27, 2021
@@ -107,9 +107,9 @@ __global__ void kernel_checkOverflows(HitContainer const *foundNtuplets,
printf("Tracks overflow %d in %d\n", idx, thisCell.layerPairId());
if (thisCell.isKilled())
atomicAdd(&c.nKilledCells, 1);
if (thisCell.unused())
if (!thisCell.unused())
atomicAdd(&c.nEmptyCells, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't nEmptyCells be renamed to nUsedCells ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I count now "used" because they are much less (so faster). Then in the report I do "1"-.

Comment on lines +222 to +226
}

if (params_.doStats_) {
// counters (add flag???)
std::lock_guard guard(lock_stat);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need two separate scopes, instead of just one ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

historical

Comment on lines +6 to +8
// cuda atomics are NOT atomics on CPU so protect stat update with a mutex
// waiting for a more general solution (incuding multiple devices) to be proposed and implemented
std::mutex lock_stat;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the cpu case, what is doing concurrent updates of the same stat ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean gpu?
On gpu atomics on device memory is used. Multiple GPUs will crash

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on cpu (w/o mutex) was obviously producing wrong (lower) results (as expected)

Copy link
Contributor

@mmusich mmusich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe for the next iterations, just to make code look less impenetrable.

float vcal = float(adc[i]) * gain - pedestal * gain;
if constexpr (isRun2) {
float conversionFactor = id[i] < 96 ? VCaltoElectronGain_L1 : VCaltoElectronGain;
float offset = id[i] < 96 ? VCaltoElectronOffset_L1 : VCaltoElectronOffset;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

96 -> phase1PixelTopology::layerStart[1] ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a issue open to fix this everywhere (phase1PixelTopology::layerStart[1] may not compile, or compile and produce wrong results)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not compile and/or produce wrong results? Can you point me to the issue, (I might have missed it)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for record (thanks @VinInn ): #35370

float offset = id[i] < 96 ? VCaltoElectronOffset_L1 : VCaltoElectronOffset;
vcal = vcal * conversionFactor + offset;
}
adc[i] = std::max(100, int(vcal));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cms-sw/trk-dpg-l2 would be nice if this 100 is taken from the same place as the other famous 100 in the regular clusterizer.

@fwyzard
Copy link
Contributor

fwyzard commented Oct 28, 2021

@VinInn did you run any checks of the performance or throughput with these changes ?

@VinInn
Copy link
Contributor Author

VinInn commented Oct 28, 2021

@VinInn did you run any checks of the performance or throughput with these changes ?

before the last commit, yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants