Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing memory leak in new Tracking AlCa producer #24626

Merged
merged 1 commit into from Sep 28, 2018

Conversation

Sam-Harper
Copy link
Contributor

@Sam-Harper Sam-Harper commented Sep 22, 2018

Dear All,

The EGamma 2018 RelVal workflows have been crashing due to jobs existing their memory allocations since pre3. After investigation from the EGamma group, we have discovered that CalibrationTrackSelectorFromDetIdList (newly introduced for pre3) has a large memory leak where it leaks the hits it copies if it does not decide to save the track. For example after 1850 events for one of the problematic jobs, this module is responsible for 20% of the live memory.
http://sharper.web.cern.ch/sharper/cgi-bin/igprof-navigator/egMem1030pre3Live1851

With this PR, this consumption is reduced to a level below the measurable threshold.
http://sharper.web.cern.ch/sharper/cgi-bin/igprof-navigator/egMem1030pre3FixedLive1851

This PR fixes the memory leak in CalibrationTrackSelectorFromDetIdList by using smart rather than bare pointers. Could I request that the AlCa group encourages new code to follow this approach in the future, its good practice in modern c++ to do use smart pointers where possible and saves effort (such as this) in the long run.

More details:
https://sharper.web.cern.ch/sharper/cms/talks/egammaRelValMemLeak.pdf

On testing:
I have only tested that it fixes the memory leak, I did not look at the output of the module to ensure it was still the same. I would be surprised if it changed due to these technical changes but just to warn that bit hasnt been tested. If in few days my jobs finish successfully (ie they arent killed or anything) I can supply the relevant output root files if requested too but otherwise I dont intend to do anything more

Best,
Sam

@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @Sam-Harper (Sam Harper) for master.

It involves the following packages:

Calibration/TkAlCaRecoProducers

@cmsbuild, @franzoni, @pohsun, @tocheng, @lpernie can you please review it and eventually sign? Thanks.
@mmusich, @threus, @tocheng this is something you requested to watch as well.
@davidlange6, @slava77, @fabiocos you are the release manager for this.

cms-bot commands are listed here

@davidlange6
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 22, 2018

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/30567/console Started: 2018/09/22 16:43

@@ -108,22 +107,20 @@ void CalibrationTrackSelectorFromDetIdList::produce(edm::Event& iEvent, const ed
}

// here there will be the selection
hits.push_back(hit->clone());
hits.emplace_back(hit->clone());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @Sam-Harper @mmusich - this copy is not needed as the hits are again copied later in the code. perhaps just a std::vector<const TrackingRecHit*> filled with 'hit' rather than 'hit->clone()' would be a more efficient fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I'm not sure I follow. I dont see where it copies. I can see places where it looks like it copies but doesnt. ownHits is a edm::OwnVector which we pass pointers into (and their ownership) not a vector. And when we create the TrackCandidate its also not a copy, it moves the own vector into it (well swaps it). Is it due to the return of that function, I though return value optimization would take care of that although I'm far from an expert in these things.

Regardless which of us is correct, you raise a good point, we should have a vector of pointers here and only if we actually save a track should we start making copies, as its coded now its inefficient. I think I'll leave that as an exercise for @mmusich, my involvement is solely a minimal fix for the problems it was causing EGamma (well for CMS but Egamma was the canary)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder who really is aware of the four(!) different semantics of OwnVector::push_back
(I was not).
https://cmssdt.cern.ch/dxr/CMSSW/source/DataFormats/Common/interface/OwnVector.h#288
In my opinion they are just confusing and obfuscating...
OwnVector is a remnant from the time when OO was supposed to "hide complexity" behind a common "simple", "natural" interface: failed! :-(look to its iterator)
anyhow this is how history goes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed up at #24759

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

Comparison job queued.

@cmsbuild
Copy link
Contributor

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-24626/30567/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 32
  • DQMHistoTests: Total histograms compared: 3146746
  • DQMHistoTests: Total failures: 1
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3146548
  • DQMHistoTests: Total skipped: 197
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 31 files compared)
  • Checked 133 log files, 14 edm output root files, 32 DQM output files

@fabiocos
Copy link
Contributor

@pohsun, @tocheng, @lpernie could you please check and sign it in case or comment?

@lpernie
Copy link
Contributor

lpernie commented Sep 27, 2018

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @davidlange6, @slava77, @smuzaffar, @fabiocos (and backports should be raised in the release meeting by the corresponding L2)

@fabiocos
Copy link
Contributor

@Sam-Harper thanks for the fix, I will move forward with this PR so as to avoid the problem in 10_3_0_pre5, and let @mmusich in case work further on the code optimization, if deemed necessary

@fabiocos
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 286baaa into cms-sw:master Sep 28, 2018
@Sam-Harper
Copy link
Contributor Author

No problem and a sensible strategy. Thanks! Btw as an aside my test job of one problematic file ended up consuming 32 gb of memory without this fix and 11gb with.

@mmusich
Copy link
Contributor

mmusich commented Sep 28, 2018

I confirm that this PR is both:

  • fixing the issue with the memory leak (here 1000 ExpressStream events reALCARECOed):
    memory
  • doesn't change the behavior of the output tracks:
    h_tracketa
    h_tracknumberofvalidhits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants