Excessive time spent in PrimaryVertexProducer in 200PU #17604

Dr15Jones · 2017-02-22T14:56:03Z

Using a 200 pileup based file, we are finding that the module labelled unsortedOfflinePrimaryVertices4D, which is of type PrimaryVertexProducer, can run for an excessively long time. Using CMSSW_9_0_X_2017-02-21-1100 on a KNL machine the module ran for 4 hours before the job was killed.

This is a problem since we want to run this workflow on NERSC to test algorithmic performance but this algorithm makes the jobs too prohibitively long to run.

We were able to isolate the problem to this line
https://github.com/cms-sw/cmssw/blob/CMSSW_9_0_X/RecoVertex/PrimaryVertexProducer/src/DAClusterizerInZT.cc#L587

The problem is it takes way too long for purge to ever return false in order to break out of that while loop

The text was updated successfully, but these errors were encountered:

cmsbuild · 2017-02-22T14:56:17Z

A new Issue was created by @Dr15Jones Chris Jones.

@davidlange6, @Dr15Jones, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones · 2017-02-22T14:57:37Z

assign upgrade

Dr15Jones · 2017-02-22T14:57:44Z

@lgray

cmsbuild · 2017-02-22T14:57:50Z

New categories assigned: upgrade

@kpedro88 you have been requested to review this Pull request/Issue and eventually sign? Thanks

Dr15Jones · 2017-02-22T14:58:26Z

@hufnagel

Dr15Jones · 2017-02-22T14:58:33Z

@gartung

kpedro88 · 2017-02-22T14:58:54Z

@bendavid also

lgray · 2017-02-22T14:59:30Z

Thanks. I suspect this might be due to numerical issues on the KNL compared to Xeon. Josh found a nice solution to those that we may want to try.

It should be cherry-pickable so we could test it very quickly.

slava77 · 2017-02-22T15:04:47Z

On 2/22/17 6:56 AM, Chris Jones wrote: Using a 200 pileup based file, we are finding that the module labelled unsortedOfflinePrimaryVertices4D, which is of type PrimaryVertexProducer, can run for an excessively long time. Using CMSSW_9_0_X_2017-02-21-1100 on a KNL machine the module ran for 4 hours before the job was killed. This is a problem since we want to run this workflow on NERSC to test algorithmic performance but this algorithm makes the jobs too prohibitively long to run.

For tests you can switch to D4 geometry layout for the existing release. Next would be to fix the tails in unsortedOfflinePrimaryVertices4D

…

We were able to isolate the problem to this line https://github.com/cms-sw/cmssw/blob/CMSSW_9_0_X/RecoVertex/PrimaryVertexProducer/src/DAClusterizerInZT.cc#L587 The problem is it takes way too long for |purge| to ever return false in order to break out of that |while| loop — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#17604>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEdcbnKLbva1GsgvCbvRU1UlhImTzda3ks5rfEyEgaJpZM4MIvta>.

Dr15Jones · 2017-02-22T16:26:21Z

For reference, this discussion originated in #17556

Dr15Jones · 2017-02-22T17:19:04Z

I ran the same event on a Xeon system and the module completed in a few minutes. It does appear that the problem is related to a difference between the KNL and Xeon systems when running this algorithm. Numeric instability could definitely trigger such a problem.

@lgray @bendavid could either of you point us to the code modification that was mentioned?

lgray · 2017-02-22T17:21:16Z

Yes, I'll send you the branch in a moment. Unless Josh has already cherry picked the appropriate commits.

…

-L

On Feb 22, 2017 11:19 AM, "Chris Jones" ***@***.***> wrote: I ran the same event on a Xeon system and the module completed in a few minutes. It does appear that the problem is related to a difference between the KNL and Xeon systems when running this algorithm. Numeric instability could definitely trigger such a problem. @lgray <https://github.com/lgray> @bendavid <https://github.com/bendavid> could either of you point us to the code modification that was mentioned? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17604 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABBMOdJ4V8uLljDNEkUumzYbCJFaA_06ks5rfG4JgaJpZM4MIvta> .

slava77 · 2017-02-22T17:33:13Z

On 2/22/17 9:19 AM, Chris Jones wrote: I ran the same event on a Xeon system and the module completed in a few minutes. It does appear that the problem is related to a difference between the KNL and Xeon systems when running this algorithm. Numeric instability could definitely trigger such a problem.

I've seen (at least an earlier version) getting stuck on intel (whatever cmsdev02 was early last Fall)

…

@lgray <https://github.com/lgray> @bendavid <https://github.com/bendavid> could either of you point us to the code modification that was mentioned? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#17604 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEdcbuhRAIRKvnLM0qP0OeyAlrHMJzNiks5rfG4IgaJpZM4MIvta>.

gartung · 2017-02-22T17:57:42Z

https://github.com/cmssw/cmssw/blob/CMSSW_9_0_X/RecoVertex/PrimaryVertexProducer/src/DAClusterizerInZT.cc#L217

I noticed that you use exp in one case and std::exp in another case. The unscoped one will call the C exp function. Is that what was intended?

gartung · 2017-02-22T17:58:53Z

https://github.com/cms-sw/cmssw/blob/CMSSW_9_0_X/RecoVertex/PrimaryVertexProducer/src/DAClusterizerInZT.cc#L217

lgray · 2017-02-22T17:59:04Z

Hi, It should be uniformly called everywhere, in case we want to sub-in the vdt libraries.

…

-Lindsey

On Wed, Feb 22, 2017 at 11:57 AM, Patrick Gartung ***@***.***> wrote: https://github.com/cmssw/cmssw/blob/CMSSW_9_0_X/RecoVertex/ PrimaryVertexProducer/src/DAClusterizerInZT.cc#L217 I noticed that you use exp in one case and std::exp in another case. The unscoped one will call the C exp function. Is that what was intended? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17604 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABBMOQtgs0U5GgzVK6-_66GLzq4rWlPsks5rfHcXgaJpZM4MIvta> .

lgray · 2017-02-22T18:02:05Z

Which CMSSW version are you presently using on the KNL system? (so I can base this branch correctly)

…

On Wed, Feb 22, 2017 at 11:58 AM, Patrick Gartung ***@***.***> wrote: https://github.com/cms-sw/cmssw/blob/CMSSW_9_0_X/RecoVertex/ PrimaryVertexProducer/src/DAClusterizerInZT.cc#L217 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17604 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABBMOfExlHNh90jP18jXqNmAxaUOb--1ks5rfHdegaJpZM4MIvta> .

gartung · 2017-02-22T18:02:22Z

There are quite a few instances of unscoped exp function calls.

gartung · 2017-02-22T18:02:57Z

CMSSW_9_0_X_2017-02-21-1100

lgray · 2017-02-22T18:14:28Z

Hi, Could you please try the branch lgray:fix_2d_vertex_stability Best, -Lindsey

…

On Wed, Feb 22, 2017 at 12:02 PM, Patrick Gartung ***@***.***> wrote: CMSSW_9_0_X_2017-02-21-1100 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17604 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABBMOQ14rMJrlBlbHU2lAg29fGnfpBfLks5rfHhSgaJpZM4MIvta> .

Dr15Jones · 2017-02-22T18:31:16Z

There are also a few other simple things that could speed up the calculation.

Move the following outside of the loop
https://cmssdt.cern.ch/lxr/source/RecoVertex/PrimaryVertexProducer/src/DAClusterizerInZT.cc?v=CMSSW_9_0_X_2017-02-22-0000#0100

Change the loop to be a range based for since you never use i for anything but tks[i].
https://cmssdt.cern.ch/lxr/source/RecoVertex/PrimaryVertexProducer/src/DAClusterizerInZT.cc?v=CMSSW_9_0_X_2017-02-22-0000#0097

Move rho0*exp(-beta*dzCutOff_*dzCutOff_) to outside the loop.
https://cmssdt.cern.ch/lxr/source/RecoVertex/PrimaryVertexProducer/src/DAClusterizerInZT.cc?v=CMSSW_9_0_X_2017-02-22-0000#0217

Break out of the loop if ++nUnique >= 2 since k0 will not change after that happens.
https://cmssdt.cern.ch/lxr/source/RecoVertex/PrimaryVertexProducer/src/DAClusterizerInZT.cc?v=CMSSW_9_0_X_2017-02-22-0000#0222

Making this more memory access friendly would take more effort than these few changes.

lgray · 2017-02-22T18:35:44Z

There is a version for 1D clustering that's been vectorized. I was intending to use that at some point as a template to do the 2D clustering as well (which should vectorize equally well). I believe in the end they got a factor 2 out of it in speed.

…

On Wed, Feb 22, 2017 at 12:31 PM, Chris Jones ***@***.***> wrote: There are also a few other simple things that could speed up the calculation. Move the following outside of the loop https://cmssdt.cern.ch/lxr/source/RecoVertex/PrimaryVertexProducer/src/ DAClusterizerInZT.cc?v=CMSSW_9_0_X_2017-02-22-0000#0100 Change the loop to be a range based for since you never use i for anything but tks[i]. https://cmssdt.cern.ch/lxr/source/RecoVertex/PrimaryVertexProducer/src/ DAClusterizerInZT.cc?v=CMSSW_9_0_X_2017-02-22-0000#0097 Move rho0*exp(-beta*dzCutOff_*dzCutOff_) to outside the loop. https://cmssdt.cern.ch/lxr/source/RecoVertex/PrimaryVertexProducer/src/ DAClusterizerInZT.cc?v=CMSSW_9_0_X_2017-02-22-0000#0217 Break out of the loop if ++nUnique >= 2 since k0 will not change after that happens. https://cmssdt.cern.ch/lxr/source/RecoVertex/PrimaryVertexProducer/src/ DAClusterizerInZT.cc?v=CMSSW_9_0_X_2017-02-22-0000#0222 Making this more memory access friendly would take more effort than these few changes. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17604 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABBMOeI6nwr6Hz8P0TEKeArAwavZJRFZks5rfH71gaJpZM4MIvta> .

gartung · 2017-02-23T00:40:46Z

Ran with just 2 events, one being a problematic event. It too over an hour to complete even with the patch. TimeReport 0.000167 0.000167 0.000167 uncleanedOnlyMulti5x5SuperClustersWithPreshower TimeReport 0.000119 0.000119 0.000119 uncleanedOnlyOutInConversionTrackProducer TimeReport 0.001844 0.001844 0.001844 uncleanedOnlyPfConversions TimeReport 0.429079 0.429079 0.429079 uncleanedOnlyPfTrack TimeReport 0.000198 0.000198 0.000198 uncleanedOnlyPfTrackElec TimeReport 96.963743 96.963743 96.963743 unsortedOfflinePrimaryVertices TimeReport 21.858344 21.858344 21.858344 unsortedOfflinePrimaryVertices1D TimeReport 4654.898395 4654.898395 4654.898395 unsortedOfflinePrimaryVertices4D TimeReport 0.022437 0.022437 0.022437 vertexMerger TimeReport 0.000000 0.000000 0.000000 zdcreco TimeReport per event per exec per visit Name T---Report end! ++ finished: end job MemoryReport> Peak virtual size 5019.48 Mbytes Key events increasing vsize: [0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0 [0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0 [1] run: 1 lumi: 1 event: 127 vsize = 5019.48 deltaVsize = 0 rss = 4501.61 delta = 0 [0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0 [0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0 [0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0 [2] run: 1 lumi: 1 event: 125 vsize = 5019.48 deltaVsize = 0 rss = 3902.14 delta = -599.469 [1] run: 1 lumi: 1 event: 127 vsize = 5019.48 deltaVsize = 0 rss = 4501.61 delta = 0 TimeReport> Time report complete in 9773.01 seconds Time Summary: - Min event: 3728.18 - Max event: 8907.42 - Avg event: 6317.8 - Total loop: 9769.8 - Total job: 9773.01 Event Throughput: 0.000204712 ev/s CPU Summary: - Total loop: 17734.8 - Total job: 17737.9 ============================================= From: Lindsey Gray <notifications@github.com> Reply-To: cms-sw/cmssw <reply@reply.github.com> Date: Wednesday, February 22, 2017 at 12:35 PM To: cms-sw/cmssw <cmssw@noreply.github.com> Cc: Patrick E Gartung <gartung@fnal.gov>, Mention <mention@noreply.github.com> Subject: Re: [cms-sw/cmssw] Excessive time spent in PrimaryVertexProducer in 200PU (#17604) There is a version for 1D clustering that's been vectorized. I was intending to use that at some point as a template to do the 2D clustering as well (which should vectorize equally well). I believe in the end they got a factor 2 out of it in speed.

On Wed, Feb 22, 2017 at 12:31 PM, Chris Jones ***@***.***> wrote: There are also a few other simple things that could speed up the calculation. Move the following outside of the loop https://cmssdt.cern.ch/lxr/source/RecoVertex/PrimaryVertexProducer/src/ DAClusterizerInZT.cc?v=CMSSW_9_0_X_2017-02-22-0000#0100 Change the loop to be a range based for since you never use i for anything but tks[i]. https://cmssdt.cern.ch/lxr/source/RecoVertex/PrimaryVertexProducer/src/ DAClusterizerInZT.cc?v=CMSSW_9_0_X_2017-02-22-0000#0097 Move rho0*exp(-beta*dzCutOff_*dzCutOff_) to outside the loop. https://cmssdt.cern.ch/lxr/source/RecoVertex/PrimaryVertexProducer/src/ DAClusterizerInZT.cc?v=CMSSW_9_0_X_2017-02-22-0000#0217 Break out of the loop if ++nUnique >= 2 since k0 will not change after that happens. https://cmssdt.cern.ch/lxr/source/RecoVertex/PrimaryVertexProducer/src/ DAClusterizerInZT.cc?v=CMSSW_9_0_X_2017-02-22-0000#0222 Making this more memory access friendly would take more effort than these few changes. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17604 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABBMOeI6nwr6Hz8P0TEKeArAwavZJRFZks5rfH71gaJpZM4MIvta> .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#17604 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AEF-WIoDGFFaBYOmxLWUu2xGRygPcA2Vks5rfIADgaJpZM4MIvta>.

gartung · 2017-02-23T01:11:46Z

Trying this with vdt::fast_exp in place of std::exp.

gartung · 2017-02-23T04:09:39Z

Sadly it made it worse

TimeReport 97.496880 97.496880 97.496880 unsortedOfflinePrimaryVertices
TimeReport 21.750846 21.750846 21.750846 unsortedOfflinePrimaryVertices1D
TimeReport 5019.465432 5019.465432 5019.465432 unsortedOfflinePrimaryVertices4D
TimeReport 0.022433 0.022433 0.022433 vertexMerger
TimeReport 0.000000 0.000000 0.000000 zdcreco
TimeReport per event per exec per visit Name

T---Report end!

++ finished: end job
MemoryReport> Peak virtual size 5018.98 Mbytes
Key events increasing vsize:
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[1] run: 1 lumi: 1 event: 127 vsize = 5018.98 deltaVsize = 0 rss = 4512.7 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[2] run: 1 lumi: 1 event: 125 vsize = 5018.98 deltaVsize = 0 rss = 3914.08 delta = -598.621
[1] run: 1 lumi: 1 event: 127 vsize = 5018.98 deltaVsize = 0 rss = 4512.7 delta = 0
TimeReport> Time report complete in 10864.1 seconds
Time Summary:

Min event: 3357.1
Max event: 10032.8
Avg event: 6694.93
Total loop: 10860.7
Total job: 10864.1
Event Throughput: 0.000184151 ev/s
CPU Summary:
Total loop: 20811.2
Total job: 20814.5

=============================================
:

Dr15Jones · 2017-02-23T15:01:11Z

For reference, on the Xeon system the module took 869 seconds to complete that event. Given the speed differences between the Xeon and KNL I'd say they are roughly the same performance.

lgray · 2017-02-23T21:20:43Z

If you add in my commits from #17622 you should see an even better improvement. The same issue was present in the vectorized code.

…

On Thu, Feb 23, 2017 at 3:10 PM, Patrick Gartung ***@***.***> wrote: 64s/64t 128 events on KNL. Much better... TimeReport 0.303056 0.303056 0.303056 uncleanedOnlyPfTrackElec TimeReport 114.332794 114.332794 114.332794 unsortedOfflinePrimaryVertices TimeReport 23.326733 23.326733 23.326733 unsortedOfflinePrimaryVertices1D TimeReport 3.480172 3.480172 3.480172 unsortedOfflinePrimaryVertices4D TimeReport 0.087038 0.087038 0.087038 vertexMerger TimeReport 0.000000 0.000000 0.000000 zdcreco TimeReport per event per exec per visit Name T---Report end! MemoryReport> Peak virtual size 110867 Mbytes Key events increasing vsize: [6] run: 1 lumi: 1 event: 12 vsize = 57902.4 deltaVsize = 1591 rss = 42653 delta = 1129.43 [4] run: 1 lumi: 1 event: 26 vsize = 55452.2 deltaVsize = 2076.75 rss = 41523.5 delta = 1572.27 [8] run: 1 lumi: 1 event: 42 vsize = 60788.4 deltaVsize = 2886 rss = 44623.2 delta = 3099.68 [12] run: 1 lumi: 1 event: 17 vsize = 68068.3 deltaVsize = 2743.5 rss = 50991.2 delta = 6368.01 [10] run: 1 lumi: 1 event: 63 vsize = 64189.1 deltaVsize = 2296.86 rss = 48034.1 delta = 3410.86 [127] run: 1 lumi: 1 event: 123 vsize = 110867 deltaVsize = 0 rss = 15993.3 delta = -2678.49 [126] run: 1 lumi: 1 event: 113 vsize = 110867 deltaVsize = 0 rss = 17400.2 delta = -1271.62 [125] run: 1 lumi: 1 event: 127 vsize = 110867 deltaVsize = 0.25 rss = 18671.8 delta = -4313.08 TimeReport> Time report complete in 7727.02 seconds Time Summary: - Min event: 1051.5 - Max event: 4408.94 - Avg event: 2646.23 - Total loop: 7711.99 - Total job: 7727.02 Event Throughput: 0.0165975 ev/s CPU Summary: - Total loop: 271172 - Total job: 271187 StallMonitor> Module label # of stalls Total stalled time Max stalled time StallMonitor> ------------ ----------- ------------------ ---------------- StallMonitor> AODSIMoutput 122 91798 s 1121.46 s — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17604 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABBMOaGbT10E8FvFru1usftJPZfyFTRVks5rffXRgaJpZM4MIvta> .

gartung · 2017-02-23T21:56:07Z

XEON with #17622

TimeReport 18.831715 18.831715 18.831715 unsortedOfflinePrimaryVertices
TimeReport 4.190681 4.190681 4.190681 unsortedOfflinePrimaryVertices1D
TimeReport 557.451650 557.451650 557.451650 unsortedOfflinePrimaryVertices4D
TimeReport 0.005501 0.005501 0.005501 vertexMerger
TimeReport 0.000000 0.000000 0.000000 zdcreco
TimeReport per event per exec per visit Name

T---Report end!

++ finished: end job
MemoryReport> Peak virtual size 5065.57 Mbytes
Key events increasing vsize:
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[1] run: 1 lumi: 1 event: 125 vsize = 5065.57 deltaVsize = 0 rss = 4315.78 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[2] run: 1 lumi: 1 event: 127 vsize = 5065.57 deltaVsize = 0 rss = 3501.36 delta = -814.422
[1] run: 1 lumi: 1 event: 125 vsize = 5065.57 deltaVsize = 0 rss = 4315.78 delta = 0
TimeReport> Time report complete in 1306.19 seconds
Time Summary:

Min event: 772.56
Max event: 1060.67
Avg event: 916.613
Total loop: 1305.11
Total job: 1306.19
Event Throughput: 0.00153243 ev/s
CPU Summary:
Total loop: 2056.97
Total job: 2057.87

gartung · 2017-02-23T21:57:19Z

Using range-for statements where possible might improve #17622 slightly.

Dr15Jones · 2017-02-23T22:00:57Z

With #17622 on Xeon

TimeReport 557.451650 557.451650 557.451650 unsortedOfflinePrimaryVertices4D

With the old algorithm on Xeon

TimeReport 556.998504 556.998504 556.998504 unsortedOfflinePrimaryVertices4D

with Patrick's full changes (which include changing to range based for loops)

TimeReport 0.884162 0.884162 0.884162 unsortedOfflinePrimaryVertices4D

So clearly the range based for loops are an essential component.

lgray · 2017-02-23T22:01:29Z

So the improvement comes from the range fors and not the loop clipping? Somehow that doesn't strike me as true.

…

On Thu, Feb 23, 2017 at 3:57 PM, Patrick Gartung ***@***.***> wrote: Using range-for statements where possible might improve #17622 <#17622> slightly. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17604 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABBMOa3uWhaCs5EjLIv1FLyYU0iWdeZ3ks5rfgDAgaJpZM4MIvta> .

Dr15Jones · 2017-02-23T22:04:02Z

The range based for may make it much easier for the optimizer

gartung · 2017-02-23T22:04:31Z

I am double checking the test with no modifications now.

It is possible the the compiler knows how to vectorize the range-for statement.

lgray · 2017-02-23T22:05:28Z

Jesus take the wheel...

…

On Thu, Feb 23, 2017 at 4:04 PM, Patrick Gartung ***@***.***> wrote: I am double checking the test with no modifications now. It is possible the the compiler knows how to vectorize the range-for statement. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17604 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABBMOVXuNT0fX-RbdYArkP_u6C1cUZKiks5rfgJxgaJpZM4MIvta> .

Dr15Jones · 2017-02-23T22:07:19Z

The loop clipping happened in purge but all debug and profiling we did said the algorithm was spending the absolute majority of its times in update.

lgray · 2017-02-23T22:11:41Z

OK There's also a bug. Zi is never reset per track, when it should be. https://github.com/gartung/cmssw/blob/b061093a4212690088b9f4273223da6fcc9991c1/RecoVertex/PrimaryVertexProducer/src/DAClusterizerInZT.cc#L95

…

On Thu, Feb 23, 2017 at 4:07 PM, Chris Jones ***@***.***> wrote: The loop clipping happened in purge but all debug and profiling we did said the algorithm was spending the absolute majority of its times in update. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17604 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABBMORvyEjTkcJUP6mBexjwRPRi9nJ5aks5rfgMzgaJpZM4MIvta> .

lgray · 2017-02-23T22:25:51Z

Updated #17622, moving making sure to move the initial value of Zi out of the loop and to reset it each time to the correct value. Please give it a try. This was already done in the vectorized version of the code. So no speed up expected there. On Thu, Feb 23, 2017 at 4:11 PM, Lindsey Gray <lindsey.gray@gmail.com> wrote:

…

OK There's also a bug. Zi is never reset per track, when it should be. https://github.com/gartung/cmssw/blob/b061093a4212690088b9f4273223da 6fcc9991c1/RecoVertex/PrimaryVertexProducer/src/DAClusterizerInZT.cc#L95 On Thu, Feb 23, 2017 at 4:07 PM, Chris Jones ***@***.***> wrote: > The loop clipping happened in purge but all debug and profiling we did > said the algorithm was spending the absolute majority of its times in > update. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#17604 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ABBMORvyEjTkcJUP6mBexjwRPRi9nJ5aks5rfgMzgaJpZM4MIvta> > . >

gartung · 2017-02-23T22:58:03Z

The clang-modernize extension can change for loops to range-for loops automatically when appropriate, ie when the index value is not referenced.

gartung · 2017-02-24T00:31:20Z

Results are the same as before with your branch. I can only speculate the I created a vectorized for loop somewhere. I will look at the difference in assembly generated by the compiler for your file and mine.

TimeReport 18.140637 18.140637 18.140637 unsortedOfflinePrimaryVertices
TimeReport 4.118666 4.118666 4.118666 unsortedOfflinePrimaryVertices1D
TimeReport 519.004210 519.004210 519.004210 unsortedOfflinePrimaryVertices4D
TimeReport 0.005757 0.005757 0.005757 vertexMerger
TimeReport 0.000000 0.000000 0.000000 zdcreco
TimeReport per event per exec per visit Name

T---Report end!

++ finished: end job
MemoryReport> Peak virtual size 5017.57 Mbytes
Key events increasing vsize:
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[1] run: 1 lumi: 1 event: 125 vsize = 5017.57 deltaVsize = 0 rss = 4326.28 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[2] run: 1 lumi: 1 event: 127 vsize = 5017.57 deltaVsize = 0 rss = 3463.34 delta = -862.934
[1] run: 1 lumi: 1 event: 125 vsize = 5017.57 deltaVsize = 0 rss = 4326.28 delta = 0
TimeReport> Time report complete in 1276.2 seconds
Time Summary:

Min event: 731.399
Max event: 1001.63
Avg event: 866.514
Total loop: 1275.09
Total job: 1276.2
Event Throughput: 0.00156851 ev/s
CPU Summary:
Total loop: 1963.84
Total job: 1964.95

lgray · 2017-02-24T00:34:08Z

Oh well, speed improvement was a bug.

…

On Feb 23, 2017 18:31, "Patrick Gartung" ***@***.***> wrote: Results are the same are before with your branch. I can only speculate the I created a vectorized for loop somewhere. I will look at the difference in assembly generated by the compiler for your file and mine. TimeReport 18.140637 18.140637 18.140637 unsortedOfflinePrimaryVertices TimeReport 4.118666 4.118666 4.118666 unsortedOfflinePrimaryVertices1D TimeReport 519.004210 519.004210 519.004210 unsortedOfflinePrimaryVertices 4D TimeReport 0.005757 0.005757 0.005757 vertexMerger TimeReport 0.000000 0.000000 0.000000 zdcreco TimeReport per event per exec per visit Name T---Report end! ++ finished: end job MemoryReport> Peak virtual size 5017.57 Mbytes Key events increasing vsize: [0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0 [0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0 [1] run: 1 lumi: 1 event: 125 vsize = 5017.57 deltaVsize = 0 rss = 4326.28 delta = 0 [0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0 [0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0 [0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0 [2] run: 1 lumi: 1 event: 127 vsize = 5017.57 deltaVsize = 0 rss = 3463.34 delta = -862.934 [1] run: 1 lumi: 1 event: 125 vsize = 5017.57 deltaVsize = 0 rss = 4326.28 delta = 0 TimeReport> Time report complete in 1276.2 seconds Time Summary: - Min event: 731.399 - Max event: 1001.63 - Avg event: 866.514 - Total loop: 1275.09 - Total job: 1276.2 Event Throughput: 0.00156851 ev/s CPU Summary: - Total loop: 1963.84 - Total job: 1964.95 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17604 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABBMOQhvSCC3eA0rMYd0Y5OALDgt-4ulks5rfiTbgaJpZM4MIvta> .

gartung · 2017-02-24T13:45:20Z

KNL 64/64 128 events
One event lasted 10 hours. Logs and stall graph on cmslpc at
/uscms_data/d2/gartung/132414.tev.fnal.gov

TimeReport 109.267564 109.267564 109.267564 unsortedOfflinePrimaryVertices
TimeReport 22.353710 22.353710 22.353710 unsortedOfflinePrimaryVertices1D
TimeReport 1181.681153 1181.681153 1181.681153 unsortedOfflinePrimaryVertices4D
TimeReport 0.086290 0.086290 0.086290 vertexMerger
TimeReport 0.000000 0.000000 0.000000 zdcreco
TimeReport per event per exec per visit Name

T---Report end!

MemoryReport> Peak virtual size 116593 Mbytes
Key events increasing vsize:
[2] run: 1 lumi: 1 event: 17 vsize = 61521.3 deltaVsize = 1399.5 rss = 48639 delta = 1183.66
[3] run: 1 lumi: 1 event: 35 vsize = 64226.8 deltaVsize = 2705.5 rss = 50732.7 delta = 2093.63
[4] run: 1 lumi: 1 event: 63 vsize = 72491.3 deltaVsize = 8264.5 rss = 58224.6 delta = 7491.92
[12] run: 1 lumi: 1 event: 31 vsize = 82373.5 deltaVsize = 1910.25 rss = 62308.3 delta = 4083.73
[7] run: 1 lumi: 1 event: 49 vsize = 77393.3 deltaVsize = 1731.75 rss = 61224.6 delta = 3000.04
[122] run: 1 lumi: 1 event: 112 vsize = 116592 deltaVsize = 0.25 rss = 22231.7 delta = -5034.11
[128] run: 1 lumi: 1 event: 6 vsize = 116593 deltaVsize = 0 rss = 23259.9 delta = 5748.87
[127] run: 1 lumi: 1 event: 109 vsize = 116593 deltaVsize = 0.25 rss = 17511 delta = -4720.7
TimeReport> Time report complete in 30722.7 seconds
Time Summary:

Min event: 1300.36
Max event: 29809.4
Avg event: 3211.52
Total loop: 30703.7
Total job: 30722.7
Event Throughput: 0.00416888 ev/s
CPU Summary:
Total loop: 436942
Total job: 436961

StallMonitor> Module label # of stalls Total stalled time Max stalled time
StallMonitor> ------------ ----------- ------------------ ----------------
StallMonitor> AODSIMoutput 102

gartung · 2017-02-24T13:49:05Z

Fixing the bug on my branch and running on XEON performance is worse than original.

TimeReport 18.888834 18.888834 18.888834 unsortedOfflinePrimaryVertices
TimeReport 4.351542 4.351542 4.351542 unsortedOfflinePrimaryVertices1D
TimeReport 1190.685604 1190.685604 1190.685604 unsortedOfflinePrimaryVertices4D
TimeReport 0.005536 0.005536 0.005536 vertexMerger
TimeReport 0.000000 0.000000 0.000000 zdcreco
TimeReport per event per exec per visit Name

T---Report end!

++ finished: end job
MemoryReport> Peak virtual size 5066.08 Mbytes
Key events increasing vsize:
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[1] run: 1 lumi: 1 event: 127 vsize = 5066.08 deltaVsize = 0 rss = 4330.92 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[2] run: 1 lumi: 1 event: 125 vsize = 5066.08 deltaVsize = 0 rss = 3298.78 delta = -1032.14
[1] run: 1 lumi: 1 event: 127 vsize = 5066.08 deltaVsize = 0 rss = 4330.92 delta = 0
TimeReport> Time report complete in 2512.3 seconds
Time Summary:

Min event: 803.909
Max event: 2275.05
Avg event: 1539.48
Total loop: 2511.4
Total job: 2512.3
Event Throughput: 0.000796368 ev/s
CPU Summary:
Total loop: 4761.77
Total job: 4762.68

lgray · 2017-02-24T13:49:09Z

At this point I'd have to look in detail at what is going on with the convergence of the algorithm. If it turns out that a large portion of the vertices are staying the same iteration to iteration in these slowly converging events, then caching might be a way around it (if vertex has been altered or is new recalculate, else use cached value), since the energy of the system is defined by the vertex state vector and not the tracks (which are immutable anyway). I can give it a try but a fix may take some time.

…

On Fri, Feb 24, 2017 at 7:45 AM, Patrick Gartung ***@***.***> wrote: KNL 64/64 128 events One event lasted 10 hours. Logs and stall graph on cmslpc at /uscms_data/d2/gartung/132414.tev.fnal.gov TimeReport 109.267564 109.267564 109.267564 unsortedOfflinePrimaryVertices TimeReport 22.353710 22.353710 22.353710 unsortedOfflinePrimaryVertices1D TimeReport 1181.681153 1181.681153 1181.681153 unsortedOfflinePrimaryVertices4D TimeReport 0.086290 0.086290 0.086290 vertexMerger TimeReport 0.000000 0.000000 0.000000 zdcreco TimeReport per event per exec per visit Name T---Report end! MemoryReport> Peak virtual size 116593 Mbytes Key events increasing vsize: [2] run: 1 lumi: 1 event: 17 vsize = 61521.3 deltaVsize = 1399.5 rss = 48639 delta = 1183.66 [3] run: 1 lumi: 1 event: 35 vsize = 64226.8 deltaVsize = 2705.5 rss = 50732.7 delta = 2093.63 [4] run: 1 lumi: 1 event: 63 vsize = 72491.3 deltaVsize = 8264.5 rss = 58224.6 delta = 7491.92 [12] run: 1 lumi: 1 event: 31 vsize = 82373.5 deltaVsize = 1910.25 rss = 62308.3 delta = 4083.73 [7] run: 1 lumi: 1 event: 49 vsize = 77393.3 deltaVsize = 1731.75 rss = 61224.6 delta = 3000.04 [122] run: 1 lumi: 1 event: 112 vsize = 116592 deltaVsize = 0.25 rss = 22231.7 delta = -5034.11 [128] run: 1 lumi: 1 event: 6 vsize = 116593 deltaVsize = 0 rss = 23259.9 delta = 5748.87 [127] run: 1 lumi: 1 event: 109 vsize = 116593 deltaVsize = 0.25 rss = 17511 delta = -4720.7 TimeReport> Time report complete in 30722.7 seconds Time Summary: - Min event: 1300.36 - Max event: 29809.4 - Avg event: 3211.52 - Total loop: 30703.7 - Total job: 30722.7 Event Throughput: 0.00416888 ev/s CPU Summary: - Total loop: 436942 - Total job: 436961 StallMonitor> Module label # of stalls Total stalled time Max stalled time StallMonitor> ------------ ----------- ------------------ ---------------- StallMonitor> AODSIMoutput 102 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17604 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABBMOR6u1gkxyhRlmM3nPqMxSPisd0FNks5rft7zgaJpZM4MIvta> .

Dr15Jones · 2017-02-24T14:18:57Z

I also ran variations of the implementation last night.

changing just tks[i].dz2 * tks[i].dt2 to tks[i].dz2 + tks[i].dt2 made the two events run faster
TimeReport 153.386689 153.386689 153.386689 unsortedOfflinePrimaryVertices4D
starting from Patrick's branch and correcting the Zi reset and changed double zratio = k->pk*k->ei/Zi; to const double zratio = tk.pi*k->ei/Zi; (since that is the actual value used in to places). These changes made the job much slower
TimeReport 2404.347155 2404.347155 2404.347155 unsortedOfflinePrimaryVertices4D

I them compared the values in the output ROOT file for the offlinePrimaryVertices4D branch (since that branch uses as input the results from unsortedOfflinePrimaryVertices4D). The original, 1. and 2. versions of the algorithm all gave different results. I expected the original to be different (because of the change in combining dz2 and dt2) but mathematically 1. and 2. should be the same. I believe this is the 'numeric instability' Lindsey mentioned.

Dr15Jones · 2017-02-24T15:19:56Z

I think one of the problems is the behavior of update with respect to a large number of vertices. If you have a job that has 100 vertices and another one that had 10,000 then because the return value of update is the sum of all movement changes from all vertices, then the job with 10,000 vertices is required to have an average per vertex delta 100x smaller than the average per vertex delta for the 100 vertices case!

That just compounds the problem where a job with a large number of vertices takes longer to calculate each call to update AND you are requiring proportionally more calls to update because the requirement for delta grows linearly with the number of vertices.

Dr15Jones · 2017-02-24T16:32:28Z

As a test, I changed the return value of update from return delta; to return delta/y.size(). I did this on top of Patrick's branch. In that case the module ran ~3.5x faster
TimeReport 696.010855 696.010855 696.010855 unsortedOfflinePrimaryVertices4D

The returned result had 8 additional vertices but the distributions looked to be pretty much the same.

Dr15Jones · 2017-02-24T17:45:19Z

As one last test, for all cases where there was a comparison update(...) > x I changed to update(...) >x/10 for the code where I compute the average delta. This was done to compensate for averaging over straight summing. In that case the speedup is only a factor of 2 but the algorithm finds the same number of vertices, although the distributions are a little different between this version and the fixed version of Patrick's branch.

TimeReport 1208.423231 1208.423231 1208.423231 unsortedOfflinePrimaryVertices4D

kpedro88 · 2017-11-10T16:00:04Z

+1
(fully) resolved by #20709

cmsbuild · 2017-11-10T16:00:22Z

This issue is fully signed and ready to be closed.

Dr15Jones · 2017-11-10T16:13:11Z

resolved

cmsbuild added the pending-assignment label Feb 22, 2017

cmsbuild added pending-signatures upgrade-pending and removed pending-assignment labels Feb 22, 2017

Dr15Jones mentioned this issue Aug 1, 2017

Vectorized 2D simulated annealing vertexing #19935

Merged

cmsbuild added fully-signed upgrade-approved and removed pending-signatures upgrade-pending labels Nov 10, 2017

Dr15Jones closed this as completed Nov 10, 2017

Excessive time spent in PrimaryVertexProducer in 200PU #17604

Excessive time spent in PrimaryVertexProducer in 200PU #17604

Comments

Dr15Jones commented Feb 22, 2017

cmsbuild commented Feb 22, 2017 • edited by smuzaffar

Dr15Jones commented Feb 22, 2017

Dr15Jones commented Feb 22, 2017

cmsbuild commented Feb 22, 2017

Dr15Jones commented Feb 22, 2017

Dr15Jones commented Feb 22, 2017

kpedro88 commented Feb 22, 2017

lgray commented Feb 22, 2017

slava77 commented Feb 22, 2017 via email

Dr15Jones commented Feb 22, 2017

Dr15Jones commented Feb 22, 2017

lgray commented Feb 22, 2017 via email

slava77 commented Feb 22, 2017 via email

gartung commented Feb 22, 2017

gartung commented Feb 22, 2017

lgray commented Feb 22, 2017 via email

lgray commented Feb 22, 2017 via email

gartung commented Feb 22, 2017

gartung commented Feb 22, 2017

lgray commented Feb 22, 2017 via email

Dr15Jones commented Feb 22, 2017

lgray commented Feb 22, 2017 via email

gartung commented Feb 23, 2017 via email

gartung commented Feb 23, 2017

gartung commented Feb 23, 2017

Dr15Jones commented Feb 23, 2017

lgray commented Feb 23, 2017 via email

gartung commented Feb 23, 2017

gartung commented Feb 23, 2017

Dr15Jones commented Feb 23, 2017

lgray commented Feb 23, 2017 via email

Dr15Jones commented Feb 23, 2017

gartung commented Feb 23, 2017

lgray commented Feb 23, 2017 via email

Dr15Jones commented Feb 23, 2017

lgray commented Feb 23, 2017 via email

lgray commented Feb 23, 2017 via email

gartung commented Feb 23, 2017

gartung commented Feb 24, 2017 • edited

lgray commented Feb 24, 2017 via email

gartung commented Feb 24, 2017

gartung commented Feb 24, 2017

lgray commented Feb 24, 2017 via email

Dr15Jones commented Feb 24, 2017

Dr15Jones commented Feb 24, 2017

Dr15Jones commented Feb 24, 2017

Dr15Jones commented Feb 24, 2017

kpedro88 commented Nov 10, 2017

cmsbuild commented Nov 10, 2017

Dr15Jones commented Nov 10, 2017

cmsbuild commented Feb 22, 2017 •

edited by smuzaffar

gartung commented Feb 24, 2017 •

edited