Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive time spent in PrimaryVertexProducer in 200PU #17604

Closed
Dr15Jones opened this issue Feb 22, 2017 · 67 comments
Closed

Excessive time spent in PrimaryVertexProducer in 200PU #17604

Dr15Jones opened this issue Feb 22, 2017 · 67 comments

Comments

@Dr15Jones
Copy link
Contributor

Using a 200 pileup based file, we are finding that the module labelled unsortedOfflinePrimaryVertices4D, which is of type PrimaryVertexProducer, can run for an excessively long time. Using CMSSW_9_0_X_2017-02-21-1100 on a KNL machine the module ran for 4 hours before the job was killed.

This is a problem since we want to run this workflow on NERSC to test algorithmic performance but this algorithm makes the jobs too prohibitively long to run.

We were able to isolate the problem to this line
https://github.com/cms-sw/cmssw/blob/CMSSW_9_0_X/RecoVertex/PrimaryVertexProducer/src/DAClusterizerInZT.cc#L587

The problem is it takes way too long for purge to ever return false in order to break out of that while loop

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 22, 2017

A new Issue was created by @Dr15Jones Chris Jones.

@davidlange6, @Dr15Jones, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@Dr15Jones
Copy link
Contributor Author

assign upgrade

@Dr15Jones
Copy link
Contributor Author

@lgray

@cmsbuild
Copy link
Contributor

New categories assigned: upgrade

@kpedro88 you have been requested to review this Pull request/Issue and eventually sign? Thanks

@Dr15Jones
Copy link
Contributor Author

@hufnagel

@Dr15Jones
Copy link
Contributor Author

@gartung

@kpedro88
Copy link
Contributor

@bendavid also

@lgray
Copy link
Contributor

lgray commented Feb 22, 2017

Thanks. I suspect this might be due to numerical issues on the KNL compared to Xeon. Josh found a nice solution to those that we may want to try.

It should be cherry-pickable so we could test it very quickly.

@slava77
Copy link
Contributor

slava77 commented Feb 22, 2017 via email

@Dr15Jones
Copy link
Contributor Author

For reference, this discussion originated in #17556

@Dr15Jones
Copy link
Contributor Author

I ran the same event on a Xeon system and the module completed in a few minutes. It does appear that the problem is related to a difference between the KNL and Xeon systems when running this algorithm. Numeric instability could definitely trigger such a problem.

@lgray @bendavid could either of you point us to the code modification that was mentioned?

@lgray
Copy link
Contributor

lgray commented Feb 22, 2017 via email

@slava77
Copy link
Contributor

slava77 commented Feb 22, 2017 via email

@gartung
Copy link
Member

gartung commented Feb 22, 2017

https://github.com/cmssw/cmssw/blob/CMSSW_9_0_X/RecoVertex/PrimaryVertexProducer/src/DAClusterizerInZT.cc#L217

I noticed that you use exp in one case and std::exp in another case. The unscoped one will call the C exp function. Is that what was intended?

@lgray
Copy link
Contributor

lgray commented Feb 22, 2017 via email

@lgray
Copy link
Contributor

lgray commented Feb 22, 2017 via email

@gartung
Copy link
Member

gartung commented Feb 22, 2017

There are quite a few instances of unscoped exp function calls.

@gartung
Copy link
Member

gartung commented Feb 22, 2017

CMSSW_9_0_X_2017-02-21-1100

@lgray
Copy link
Contributor

lgray commented Feb 22, 2017 via email

@Dr15Jones
Copy link
Contributor Author

There are also a few other simple things that could speed up the calculation.

Move the following outside of the loop
https://cmssdt.cern.ch/lxr/source/RecoVertex/PrimaryVertexProducer/src/DAClusterizerInZT.cc?v=CMSSW_9_0_X_2017-02-22-0000#0100

Change the loop to be a range based for since you never use i for anything but tks[i].
https://cmssdt.cern.ch/lxr/source/RecoVertex/PrimaryVertexProducer/src/DAClusterizerInZT.cc?v=CMSSW_9_0_X_2017-02-22-0000#0097

Move rho0*exp(-beta*dzCutOff_*dzCutOff_) to outside the loop.
https://cmssdt.cern.ch/lxr/source/RecoVertex/PrimaryVertexProducer/src/DAClusterizerInZT.cc?v=CMSSW_9_0_X_2017-02-22-0000#0217

Break out of the loop if ++nUnique >= 2 since k0 will not change after that happens.
https://cmssdt.cern.ch/lxr/source/RecoVertex/PrimaryVertexProducer/src/DAClusterizerInZT.cc?v=CMSSW_9_0_X_2017-02-22-0000#0222

Making this more memory access friendly would take more effort than these few changes.

@lgray
Copy link
Contributor

lgray commented Feb 22, 2017 via email

@gartung
Copy link
Member

gartung commented Feb 23, 2017 via email

@gartung
Copy link
Member

gartung commented Feb 23, 2017

Trying this with vdt::fast_exp in place of std::exp.

@gartung
Copy link
Member

gartung commented Feb 23, 2017

Sadly it made it worse

TimeReport 97.496880 97.496880 97.496880 unsortedOfflinePrimaryVertices
TimeReport 21.750846 21.750846 21.750846 unsortedOfflinePrimaryVertices1D
TimeReport 5019.465432 5019.465432 5019.465432 unsortedOfflinePrimaryVertices4D
TimeReport 0.022433 0.022433 0.022433 vertexMerger
TimeReport 0.000000 0.000000 0.000000 zdcreco
TimeReport per event per exec per visit Name

T---Report end!

++ finished: end job
MemoryReport> Peak virtual size 5018.98 Mbytes
Key events increasing vsize:
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[1] run: 1 lumi: 1 event: 127 vsize = 5018.98 deltaVsize = 0 rss = 4512.7 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[2] run: 1 lumi: 1 event: 125 vsize = 5018.98 deltaVsize = 0 rss = 3914.08 delta = -598.621
[1] run: 1 lumi: 1 event: 127 vsize = 5018.98 deltaVsize = 0 rss = 4512.7 delta = 0
TimeReport> Time report complete in 10864.1 seconds
Time Summary:

  • Min event: 3357.1
  • Max event: 10032.8
  • Avg event: 6694.93
  • Total loop: 10860.7
  • Total job: 10864.1
    Event Throughput: 0.000184151 ev/s
    CPU Summary:
  • Total loop: 20811.2
  • Total job: 20814.5

=============================================
:

@Dr15Jones
Copy link
Contributor Author

For reference, on the Xeon system the module took 869 seconds to complete that event. Given the speed differences between the Xeon and KNL I'd say they are roughly the same performance.

@lgray
Copy link
Contributor

lgray commented Feb 23, 2017 via email

@gartung
Copy link
Member

gartung commented Feb 23, 2017

XEON with #17622

TimeReport 18.831715 18.831715 18.831715 unsortedOfflinePrimaryVertices
TimeReport 4.190681 4.190681 4.190681 unsortedOfflinePrimaryVertices1D
TimeReport 557.451650 557.451650 557.451650 unsortedOfflinePrimaryVertices4D
TimeReport 0.005501 0.005501 0.005501 vertexMerger
TimeReport 0.000000 0.000000 0.000000 zdcreco
TimeReport per event per exec per visit Name

T---Report end!

++ finished: end job
MemoryReport> Peak virtual size 5065.57 Mbytes
Key events increasing vsize:
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[1] run: 1 lumi: 1 event: 125 vsize = 5065.57 deltaVsize = 0 rss = 4315.78 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[2] run: 1 lumi: 1 event: 127 vsize = 5065.57 deltaVsize = 0 rss = 3501.36 delta = -814.422
[1] run: 1 lumi: 1 event: 125 vsize = 5065.57 deltaVsize = 0 rss = 4315.78 delta = 0
TimeReport> Time report complete in 1306.19 seconds
Time Summary:

  • Min event: 772.56
  • Max event: 1060.67
  • Avg event: 916.613
  • Total loop: 1305.11
  • Total job: 1306.19
    Event Throughput: 0.00153243 ev/s
    CPU Summary:
  • Total loop: 2056.97
  • Total job: 2057.87

@gartung
Copy link
Member

gartung commented Feb 23, 2017

Using range-for statements where possible might improve #17622 slightly.

@Dr15Jones
Copy link
Contributor Author

With #17622 on Xeon

TimeReport 557.451650 557.451650 557.451650 unsortedOfflinePrimaryVertices4D

With the old algorithm on Xeon

TimeReport 556.998504 556.998504 556.998504 unsortedOfflinePrimaryVertices4D

with Patrick's full changes (which include changing to range based for loops)

TimeReport 0.884162 0.884162 0.884162 unsortedOfflinePrimaryVertices4D

So clearly the range based for loops are an essential component.

@lgray
Copy link
Contributor

lgray commented Feb 23, 2017 via email

@Dr15Jones
Copy link
Contributor Author

The range based for may make it much easier for the optimizer

@gartung
Copy link
Member

gartung commented Feb 23, 2017

I am double checking the test with no modifications now.

It is possible the the compiler knows how to vectorize the range-for statement.

@lgray
Copy link
Contributor

lgray commented Feb 23, 2017 via email

@Dr15Jones
Copy link
Contributor Author

The loop clipping happened in purge but all debug and profiling we did said the algorithm was spending the absolute majority of its times in update.

@lgray
Copy link
Contributor

lgray commented Feb 23, 2017 via email

@lgray
Copy link
Contributor

lgray commented Feb 23, 2017 via email

@gartung
Copy link
Member

gartung commented Feb 23, 2017

The clang-modernize extension can change for loops to range-for loops automatically when appropriate, ie when the index value is not referenced.

@gartung
Copy link
Member

gartung commented Feb 24, 2017

Results are the same as before with your branch. I can only speculate the I created a vectorized for loop somewhere. I will look at the difference in assembly generated by the compiler for your file and mine.

TimeReport 18.140637 18.140637 18.140637 unsortedOfflinePrimaryVertices
TimeReport 4.118666 4.118666 4.118666 unsortedOfflinePrimaryVertices1D
TimeReport 519.004210 519.004210 519.004210 unsortedOfflinePrimaryVertices4D
TimeReport 0.005757 0.005757 0.005757 vertexMerger
TimeReport 0.000000 0.000000 0.000000 zdcreco
TimeReport per event per exec per visit Name

T---Report end!

++ finished: end job
MemoryReport> Peak virtual size 5017.57 Mbytes
Key events increasing vsize:
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[1] run: 1 lumi: 1 event: 125 vsize = 5017.57 deltaVsize = 0 rss = 4326.28 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[2] run: 1 lumi: 1 event: 127 vsize = 5017.57 deltaVsize = 0 rss = 3463.34 delta = -862.934
[1] run: 1 lumi: 1 event: 125 vsize = 5017.57 deltaVsize = 0 rss = 4326.28 delta = 0
TimeReport> Time report complete in 1276.2 seconds
Time Summary:

  • Min event: 731.399
  • Max event: 1001.63
  • Avg event: 866.514
  • Total loop: 1275.09
  • Total job: 1276.2
    Event Throughput: 0.00156851 ev/s
    CPU Summary:
  • Total loop: 1963.84
  • Total job: 1964.95

@lgray
Copy link
Contributor

lgray commented Feb 24, 2017 via email

@gartung
Copy link
Member

gartung commented Feb 24, 2017

KNL 64/64 128 events
One event lasted 10 hours. Logs and stall graph on cmslpc at
/uscms_data/d2/gartung/132414.tev.fnal.gov

TimeReport 109.267564 109.267564 109.267564 unsortedOfflinePrimaryVertices
TimeReport 22.353710 22.353710 22.353710 unsortedOfflinePrimaryVertices1D
TimeReport 1181.681153 1181.681153 1181.681153 unsortedOfflinePrimaryVertices4D
TimeReport 0.086290 0.086290 0.086290 vertexMerger
TimeReport 0.000000 0.000000 0.000000 zdcreco
TimeReport per event per exec per visit Name

T---Report end!

MemoryReport> Peak virtual size 116593 Mbytes
Key events increasing vsize:
[2] run: 1 lumi: 1 event: 17 vsize = 61521.3 deltaVsize = 1399.5 rss = 48639 delta = 1183.66
[3] run: 1 lumi: 1 event: 35 vsize = 64226.8 deltaVsize = 2705.5 rss = 50732.7 delta = 2093.63
[4] run: 1 lumi: 1 event: 63 vsize = 72491.3 deltaVsize = 8264.5 rss = 58224.6 delta = 7491.92
[12] run: 1 lumi: 1 event: 31 vsize = 82373.5 deltaVsize = 1910.25 rss = 62308.3 delta = 4083.73
[7] run: 1 lumi: 1 event: 49 vsize = 77393.3 deltaVsize = 1731.75 rss = 61224.6 delta = 3000.04
[122] run: 1 lumi: 1 event: 112 vsize = 116592 deltaVsize = 0.25 rss = 22231.7 delta = -5034.11
[128] run: 1 lumi: 1 event: 6 vsize = 116593 deltaVsize = 0 rss = 23259.9 delta = 5748.87
[127] run: 1 lumi: 1 event: 109 vsize = 116593 deltaVsize = 0.25 rss = 17511 delta = -4720.7
TimeReport> Time report complete in 30722.7 seconds
Time Summary:

  • Min event: 1300.36
  • Max event: 29809.4
  • Avg event: 3211.52
  • Total loop: 30703.7
  • Total job: 30722.7
    Event Throughput: 0.00416888 ev/s
    CPU Summary:
  • Total loop: 436942
  • Total job: 436961

StallMonitor> Module label # of stalls Total stalled time Max stalled time
StallMonitor> ------------ ----------- ------------------ ----------------
StallMonitor> AODSIMoutput 102

@gartung
Copy link
Member

gartung commented Feb 24, 2017

Fixing the bug on my branch and running on XEON performance is worse than original.

TimeReport 18.888834 18.888834 18.888834 unsortedOfflinePrimaryVertices
TimeReport 4.351542 4.351542 4.351542 unsortedOfflinePrimaryVertices1D
TimeReport 1190.685604 1190.685604 1190.685604 unsortedOfflinePrimaryVertices4D
TimeReport 0.005536 0.005536 0.005536 vertexMerger
TimeReport 0.000000 0.000000 0.000000 zdcreco
TimeReport per event per exec per visit Name

T---Report end!

++ finished: end job
MemoryReport> Peak virtual size 5066.08 Mbytes
Key events increasing vsize:
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[1] run: 1 lumi: 1 event: 127 vsize = 5066.08 deltaVsize = 0 rss = 4330.92 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0
[2] run: 1 lumi: 1 event: 125 vsize = 5066.08 deltaVsize = 0 rss = 3298.78 delta = -1032.14
[1] run: 1 lumi: 1 event: 127 vsize = 5066.08 deltaVsize = 0 rss = 4330.92 delta = 0
TimeReport> Time report complete in 2512.3 seconds
Time Summary:

  • Min event: 803.909
  • Max event: 2275.05
  • Avg event: 1539.48
  • Total loop: 2511.4
  • Total job: 2512.3
    Event Throughput: 0.000796368 ev/s
    CPU Summary:
  • Total loop: 4761.77
  • Total job: 4762.68

@lgray
Copy link
Contributor

lgray commented Feb 24, 2017 via email

@Dr15Jones
Copy link
Contributor Author

I also ran variations of the implementation last night.

  1. changing just tks[i].dz2 * tks[i].dt2 to tks[i].dz2 + tks[i].dt2 made the two events run faster
    TimeReport 153.386689 153.386689 153.386689 unsortedOfflinePrimaryVertices4D

  2. starting from Patrick's branch and correcting the Zi reset and changed double zratio = k->pk*k->ei/Zi; to const double zratio = tk.pi*k->ei/Zi; (since that is the actual value used in to places). These changes made the job much slower
    TimeReport 2404.347155 2404.347155 2404.347155 unsortedOfflinePrimaryVertices4D

I them compared the values in the output ROOT file for the offlinePrimaryVertices4D branch (since that branch uses as input the results from unsortedOfflinePrimaryVertices4D). The original, 1. and 2. versions of the algorithm all gave different results. I expected the original to be different (because of the change in combining dz2 and dt2) but mathematically 1. and 2. should be the same. I believe this is the 'numeric instability' Lindsey mentioned.

@Dr15Jones
Copy link
Contributor Author

I think one of the problems is the behavior of update with respect to a large number of vertices. If you have a job that has 100 vertices and another one that had 10,000 then because the return value of update is the sum of all movement changes from all vertices, then the job with 10,000 vertices is required to have an average per vertex delta 100x smaller than the average per vertex delta for the 100 vertices case!

That just compounds the problem where a job with a large number of vertices takes longer to calculate each call to update AND you are requiring proportionally more calls to update because the requirement for delta grows linearly with the number of vertices.

@Dr15Jones
Copy link
Contributor Author

As a test, I changed the return value of update from return delta; to return delta/y.size(). I did this on top of Patrick's branch. In that case the module ran ~3.5x faster
TimeReport 696.010855 696.010855 696.010855 unsortedOfflinePrimaryVertices4D

The returned result had 8 additional vertices but the distributions looked to be pretty much the same.

@Dr15Jones
Copy link
Contributor Author

As one last test, for all cases where there was a comparison update(...) > x I changed to update(...) >x/10 for the code where I compute the average delta. This was done to compensate for averaging over straight summing. In that case the speedup is only a factor of 2 but the algorithm finds the same number of vertices, although the distributions are a little different between this version and the fixed version of Patrick's branch.

TimeReport 1208.423231 1208.423231 1208.423231 unsortedOfflinePrimaryVertices4D

@kpedro88
Copy link
Contributor

+1
(fully) resolved by #20709

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

@Dr15Jones
Copy link
Contributor Author

resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants