Decrease memory used by OscarMTProducer #26578

Dr15Jones · 2019-04-30T18:42:10Z

PR description:

Decrease the memory held by OscarMTProducer during event processing. This is accomplished by

clearing temporary memory used for processing an event at the end of that event instead of at the beginning of the next event
clear the G4 allocator used to store all CaloG4Hits each event rather than letting it grow for the entire job (this was found by IgProf)
cleanup memory at the end of the job thereby making it easier to use tools to find per event memory leaks.

PR validation:

This was triggered from the production problem here: https://hypernews.cern.ch/HyperNews/CMS/get/simDevelopment/1891.html

I used the configuration from the hypernews and ran IgProfService to isolate memory leaks and per-event memory hoarding. Once all the changes were done, I ran it under CMSSW_10_6_ASAN_X_2019-04-26-2300 to double check there were no memory errors (it was clean).

Using the original code running with 8 threads on only 20 events, the memory kept growing and reached RSS=3.6GB After the code change, running on 8 threads over 135 events the job was still around RSS=2GB after 20 and stayed near 2.2GB until the very end of the job when it grew temporarily to RSS=3.4.

Ownership of the hit collection is transfered to G4HCofThisEvent.

…tector The numbering scheme object is a static thread_local function variable and therefore should not be deleted explicitly.

AttachSD has not state and therefore can just be used as a local variable.

All memory except for G4Run (which has segmentation fault problems) is now cleaned up at the end of the job.

The reusehit container now owns the CaloG4Hits it contains and it is now cleared at the end of processing an event. Removed the need for the helper container selIndex. Made hitvec a temporary variable of a member function.

Can now call ResetStorage on the G4 allocator that holds all the memory for all CaloG4Hits in a particular thread. This will allow clearing of the memory at the end of each event. IgProf showed this was a major memory hoarder.

Now clear temporary memory used to process an event before leaving OscarMTProducer::produce. Properly cleanup the thread_local storage at the end of the job. This makes it easier to find per event memory leaks.

The containers were supposed to be either removed or replaced.

Given there is a RunManagerMTWorker per stream, and the tls is per threads, we need to be sure not to delete the same tls multiple times.

cmsbuild · 2019-04-30T18:42:45Z

The code-checks are being triggered in jenkins.

cmsbuild · 2019-04-30T18:49:34Z

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-26578/9502

This PR adds an extra 52KB to repository

Code check has found code style and quality issues which could be resolved by applying following patch(s)

code-checks:
https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-26578/9502/code-checks.patch
e.g. curl https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-26578/9502/code-checks.patch | patch -p1
You can also run scram build code-checks to apply code checks directly

cmsbuild · 2019-04-30T18:57:00Z

The code-checks are being triggered in jenkins.

cmsbuild · 2019-04-30T19:04:30Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-26578/9503

This PR adds an extra 52KB to repository

cmsbuild · 2019-04-30T19:05:22Z

A new Pull Request was created by @Dr15Jones (Chris Jones) for master.

It involves the following packages:

SimG4CMS/Calo
SimG4CMS/Tracker
SimG4Core/Application
SimG4Core/SensitiveDetector

@cmsbuild, @civanch, @mdhildreth can you please review it and eventually sign? Thanks.
@makortel this is something you requested to watch as well.
@davidlange6, @slava77, @fabiocos you are the release manager for this.

cms-bot commands are listed here

civanch · 2019-04-30T20:57:08Z

please test

cmsbuild · 2019-04-30T20:59:18Z

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/34434/console Started: 2019/04/30 22:59

cmsbuild · 2019-04-30T23:00:51Z

+1
Tested at: 0c78f96
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-26578/34434/summary.html

cmsbuild · 2019-04-30T23:00:57Z

Comparison job queued.

cmsbuild · 2019-05-01T01:15:05Z

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-26578/34434/summary.html

Comparison Summary:

No significant changes to the logs found
Reco comparison results: 4 differences found in the comparisons
DQMHistoTests: Total files compared: 33
DQMHistoTests: Total histograms compared: 3211964
DQMHistoTests: Total failures: 1
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3211759
DQMHistoTests: Total skipped: 204
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 32 files compared)
Checked 137 log files, 14 edm output root files, 33 DQM output files

civanch · 2019-05-01T12:42:12Z

@Dr15Jones , great job!
The fact, that there is no difference in regression means, that this PR is correct. Let me some time to understand, where we lost proper clean-up of intermediate calo hits - the problem appears sometime after 10_2. Concerning object deletion end of run: since migration to MT we have a G4 problem, that not all objects are properly deleted end of run. Geant4 was improving in this respect for each new release but this G4 10.4 is still not fully cleaned. Moreover, OscarProducer should delete some G4 objects and I think what you propose in this PR is necessary and is more correct than before.

Dr15Jones · 2019-05-01T15:14:13Z

For completion, I ran 135 events using the original code (all the numbers I've used have been from CMSSW_10_5_0 as that is where the production jobs saw the problem). In that case the job ran around RSS > 3GB for most and then at the end went up to 4GB.

civanch · 2019-05-02T10:33:23Z

@Dr15Jones , I run ttbar 600 events for several versions of CMSSW and for this fix. In these events there is no memory grows on such statistics. What I see RSS=1.76 GB before the fix and 1.66 GB after the fix. This may be well explained by clean-up end of event. VSIZE is changed from 3.67 GB to 3.57 GB. What I do not understand - in CMSSW_10_1 for the same events but 2017 geometry we had RSS=1.45 GB and VSIZE= 2.34. If I rerun with this fix and 2017 global tag and era I still have RSS=1.65 GB and VSIZE=3.57 GB.

In summary, this PR really fix memory of G4Allocator and improve deletion at the end of run. However, I do not understand the increase RSS and VSIZE at some moment during 10_2 developments. This may be happens not due to SIM itself but in other part of CMSSW, at least, cannot say this for sure.

Concerning G4Allocator: all objects following G4Allocator must be deleted just after usage, in this case, allocated memory will not grow and will be used efficiently. If they are not deleted the RSS will grow but the memory will be cleaned end of run, so memory leak will not be detected in easy way.

civanch · 2019-05-02T10:39:08Z

+1
This PR make an important fix of memory, @fabiocos , please, have a look and merge it.
Some questions still remains unclear, at least, for me, but it is a subject of further investigation.

cmsbuild · 2019-05-02T10:39:30Z

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @davidlange6, @slava77, @smuzaffar, @fabiocos (and backports should be raised in the release meeting by the corresponding L2)

Dr15Jones · 2019-05-02T14:56:19Z

@civanch what I found helpful in finding why 'live' memory had increased was to add

process.add_(cms.Service("IgProfService", 
                         reportEventInterval = cms.untracked.int32(10),
                         reportToFileAtPostEvent = cms.untracked.string("igtest_service_3.%E.out")))

to the configuration and then run

igprof -d -mp -z -o igprof_<something unique>.mp.gz -t cmsRunGlibC cmsRunGlibC <test_cfg.py>

and then compare two different outputs from the same job

igprof-analyse -d -v -g -r MEM_LIVE --baseline igtest_service_3.1.out --diff-mode igtest_service_3.11.out

You could do something similar between two different jobs. Use the IgProfService to 'snapshot' the live memory of both jobs at some point in time (say after 10 events) and then use igprof-analyse with --diff-mode to see what memory is different between the two jobs.

fabiocos · 2019-05-02T15:05:07Z

@civanch did you perhaps compare with the same 600 events without this? Here we see identity, but usually just a few tens of events are run (although if something really wrong happens it should likely appear)

fabiocos · 2019-05-03T07:40:54Z

I've run 100 TTbar events in the latest IB with and without this PR, finding a set of differences that are anyway minor, what surprises me is that most of the G4 hits level plots are ok, but I see a couple of differences at GEN level (TTbar analysis, one particle of difference), and they are likely the cause of all the subsequent observed differences. Therefore I conclude that those differences should not be mainly linked to this PR (although I'll do an extra check, GEN should be reproducible).

fabiocos · 2019-05-03T08:45:17Z

+1

Dr15Jones added 10 commits April 30, 2019 10:18

CaloSD does not own the hit collection

5aded58

Ownership of the hit collection is transfered to G4HCofThisEvent.

The numbering scheme object is not owned by TkAccumulatingSensitiveDe…

91b7fe0

…tector The numbering scheme object is a static thread_local function variable and therefore should not be deleted explicitly.

AttachSD can just be created as needed in RunManager

4a54fa8

AttachSD has not state and therefore can just be used as a local variable.

Use auto& for sensitive detector collection variables

78f9bdc

Properly cleanup thread locals in RunManagerMTWorker

42c89fa

All memory except for G4Run (which has segmentation fault problems) is now cleaned up at the end of the job.

Better memory handling in CaloSD

07a36d8

The reusehit container now owns the CaloG4Hits it contains and it is now cleared at the end of processing an event. Removed the need for the helper container selIndex. Made hitvec a temporary variable of a member function.

Be able to reset the memory used to hold all CaloG4Hits

c21aff0

Can now call ResetStorage on the G4 allocator that holds all the memory for all CaloG4Hits in a particular thread. This will allow clearing of the memory at the end of each event. IgProf showed this was a major memory hoarder.

Properly handle per Event/Job memory cleanup in OscarMTProducer

c0b63e5

Now clear temporary memory used to process an event before leaving OscarMTProducer::produce. Properly cleanup the thread_local storage at the end of the job. This makes it easier to find per event memory leaks.

Fixed incorrect cherry-picking of changes in CaloSD

00efce5

The containers were supposed to be either removed or replaced.

Do not delete thread local storage multiple times

efdd9bf

Given there is a RunManagerMTWorker per stream, and the tls is per threads, we need to be sure not to delete the same tls multiple times.

cmsbuild added this to the CMSSW_10_6_X milestone Apr 30, 2019

cmsbuild added code-checks-pending comparison-pending orp-pending pending-signatures simulation-pending tests-pending labels Apr 30, 2019

cmsbuild added code-checks-rejected and removed code-checks-pending labels Apr 30, 2019

Apply code-checks change

0c78f96

cmsbuild added code-checks-pending and removed code-checks-rejected labels Apr 30, 2019

cmsbuild added code-checks-approved and removed code-checks-pending labels Apr 30, 2019

cmsbuild added tests-started and removed tests-pending labels Apr 30, 2019

cmsbuild added tests-approved and removed tests-started labels Apr 30, 2019

cmsbuild added comparison-available and removed comparison-pending labels May 1, 2019

civanch mentioned this pull request May 1, 2019

Code format #26570

Closed

cmsbuild added fully-signed simulation-approved and removed pending-signatures simulation-pending labels May 2, 2019

cmsbuild added orp-approved and removed orp-pending labels May 3, 2019

cmsbuild merged commit b950223 into cms-sw:master May 3, 2019

Dr15Jones deleted the memoryCleanupG4 branch May 10, 2019 14:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decrease memory used by OscarMTProducer #26578

Decrease memory used by OscarMTProducer #26578

Dr15Jones commented Apr 30, 2019

cmsbuild commented Apr 30, 2019

cmsbuild commented Apr 30, 2019

cmsbuild commented Apr 30, 2019

cmsbuild commented Apr 30, 2019

cmsbuild commented Apr 30, 2019

civanch commented Apr 30, 2019

cmsbuild commented Apr 30, 2019 •

edited

cmsbuild commented Apr 30, 2019

cmsbuild commented Apr 30, 2019

cmsbuild commented May 1, 2019

civanch commented May 1, 2019

Dr15Jones commented May 1, 2019

civanch commented May 2, 2019 •

edited

civanch commented May 2, 2019

cmsbuild commented May 2, 2019

Dr15Jones commented May 2, 2019

fabiocos commented May 2, 2019

fabiocos commented May 3, 2019 •

edited

fabiocos commented May 3, 2019

Decrease memory used by OscarMTProducer #26578

Decrease memory used by OscarMTProducer #26578

Conversation

Dr15Jones commented Apr 30, 2019

PR description:

PR validation:

cmsbuild commented Apr 30, 2019

cmsbuild commented Apr 30, 2019

cmsbuild commented Apr 30, 2019

cmsbuild commented Apr 30, 2019

cmsbuild commented Apr 30, 2019

civanch commented Apr 30, 2019

cmsbuild commented Apr 30, 2019 • edited

cmsbuild commented Apr 30, 2019

cmsbuild commented Apr 30, 2019

cmsbuild commented May 1, 2019

civanch commented May 1, 2019

Dr15Jones commented May 1, 2019

civanch commented May 2, 2019 • edited

civanch commented May 2, 2019

cmsbuild commented May 2, 2019

Dr15Jones commented May 2, 2019

fabiocos commented May 2, 2019

fabiocos commented May 3, 2019 • edited

fabiocos commented May 3, 2019

cmsbuild commented Apr 30, 2019 •

edited

civanch commented May 2, 2019 •

edited

fabiocos commented May 3, 2019 •

edited