Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refresh the index map of pointers to hits and tracks in Pandora translator #8015

Merged
merged 2 commits into from Mar 2, 2015

Conversation

lgray
Copy link
Contributor

@lgray lgray commented Mar 2, 2015

Fix issue where the index map was not reset each iteration, thus eventually causing a crash either due to bad read or memory usage.

@vandreev11 @pfs @kpedro88 @sethzenz

@lgray
Copy link
Contributor Author

lgray commented Mar 2, 2015

@vandreev11 @pfs @kpedro88 @sethzenz -- just to make sure

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 2, 2015

A new Pull Request was created by @lgray (Lindsey Gray) for CMSSW_6_2_X_SLHC.

Refresh the index map of pointers to hits and tracks in Pandora translator

It involves the following packages:

RecoParticleFlow/PandoraTranslator

The following packages do not have a category, yet:

RecoParticleFlow/PandoraTranslator

@cmsbuild, @nclopezo can you please review it and eventually sign? Thanks.
@mmarionncern, @bachtis this is something you requested to watch as well.
You can sign-off by replying to this message having '+1' in the first line of your reply.
You can reject by replying to this message having '-1' in the first line of your reply.
If you are a L2 or a release manager you can ask for tests by saying 'please test' in the first line of a comment.
@fratnikov, @mark-grimes you are the release manager for this.
You can merge this pull request by typing 'merge' in the first line of your comment.

@mark-grimes
Copy link

Please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 2, 2015

The tests are being triggered in jenkins.

@lgray
Copy link
Contributor Author

lgray commented Mar 2, 2015

@mark-grimes I'm probably going to move the .clear() to the end of produce() so the memory is cleared for the rest of the present event.

@Dr15Jones
Copy link
Contributor

If you are dealing with a map, then you really are better off putting it on the stack since you will not be amortizing any memory allocations.

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 2, 2015

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 2, 2015

Pull request #8015 was updated. @cmsbuild, @nclopezo can you please check and sign again.

@lgray
Copy link
Contributor Author

lgray commented Mar 2, 2015

There is no substantial change to the runtime aside from retained memory, please go ahead and merge. Private tests indicate the crash is gone, or at least we no longer see errors within 20 PU 140 events, where we were able to make the crash reproducible.

@mark-grimes
Copy link

merge

cmsbuild added a commit that referenced this pull request Mar 2, 2015
Refresh the index map of pointers to hits and tracks in Pandora translator
@cmsbuild cmsbuild merged commit 7119b93 into cms-sw:CMSSW_6_2_X_SLHC Mar 2, 2015
@mark-grimes
Copy link

Runtime and memory use after applying this patch for 42 events (the rest are still running). I guess the cuts in #8023 could speed this up.
(EDIT - This is for RelValQCDForPF_14TeV at 140 pileup)
pandoratime
pandoramemoryuse

@lgray
Copy link
Contributor Author

lgray commented Mar 4, 2015

@mark-grimes RSS is never over 3GB, is this OK for RelVals? Or will we get killed on VM size?

@lgray
Copy link
Contributor Author

lgray commented Mar 4, 2015

@mark-grimes Speed ups from #8023 will likely be rather random in nature.

@mark-grimes
Copy link

I think so. The current plan is to submit the relvals and keep fingers crossed; release came out yesterday. @davidlange6 was reporting a few hits of ~3.1GB, my worry is that it depends on the sample used. I was using QCDForPF which is flat, he was using QCD 80-120 which will have more in the endcaps.

For the record, full plots are below (same as above but for the finished job). The job crashed on event 63 with an error in SoftPFMuonTagInfoProducer, already fixed in #8022. I also added the user time split by module for the top 10, although it's not particularly informative since Pandora dominates as expected.

Average time per event comes out at 13 +/- 4 minutes.

eventruntime
memory
moduleruntime

@davidlange6
Copy link
Contributor

Hi Mark
For normalization purposes, what is the Shashlik @140PU timing on the same machine (ideally also without dqm to compare apples and apples as well as what we do in production)?

On Mar 4, 2015, at 3:19 PM, Mark Grimes notifications@github.com wrote:

I think so. The current plan is to submit the relvals and keep fingers crossed; release came out yesterday. @davidlange6 was reporting a few hits of ~3.1GB, my worry is that it depends on the sample used. I was using QCDForPF which is flat, he was using QCD 80-120 which will have more in the endcaps.

For the record, full plots are below (same as above but for the finished job). The job crashed on event 63 with an error in SoftPFMuonTagInfoProducer, already fixed in #8022. I also added the user time split by module for the top 10, although it's not particularly informative since Pandora dominates as expected.

Average time per event comes out at 13 +/- 4 minutes.


Reply to this email directly or view it on GitHub.

@mark-grimes
Copy link

I'll start a job to check, I've always run with DQM on.

@mark-grimes
Copy link

First 23 Shashlik events are 2.4 +/- 0.4 minutes/event.

@lgray
Copy link
Contributor Author

lgray commented Mar 4, 2015

If you could run the HGCal-Pandora jobs with igprof on the events that have tails in to 20 minutes or more it'll help too.

At some point the algorithms need to be refined much further. Right now the speedups are from patches and short circuits around slow bits, doing a proper rewrite would improve this much more.

@lgray
Copy link
Contributor Author

lgray commented Mar 4, 2015

An update here: Kevin and I are addressing some issues of memory locality (i.e. keeping workspaces in the same place, and not causing 'churn') that we found in the modified pandora algos. This seems to give an improvement to both RSS and timing. Stay tuned, we're checking the effects on physics performance.

@pfs @vandreev11

@lgray
Copy link
Contributor Author

lgray commented Mar 4, 2015

OK, not much of a timing improvement, but the memory usage patterns are certainly more sane. Definitely helps RSS a bit too.

@mark-grimes
Copy link

Although it's not really relevant to this pull request, for completeness here are the Shashlik plotes.

Average time per event is 2.4 +/- 0.6 minutes.

shashlikeventruntime
shashlikmemory
shashlikmoduleruntime

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants