Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decrease memory used by OscarMTProducer #26578

Merged
merged 11 commits into from May 3, 2019

Conversation

Dr15Jones
Copy link
Contributor

PR description:

Decrease the memory held by OscarMTProducer during event processing. This is accomplished by

  • clearing temporary memory used for processing an event at the end of that event instead of at the beginning of the next event
  • clear the G4 allocator used to store all CaloG4Hits each event rather than letting it grow for the entire job (this was found by IgProf)
  • cleanup memory at the end of the job thereby making it easier to use tools to find per event memory leaks.

PR validation:

This was triggered from the production problem here: https://hypernews.cern.ch/HyperNews/CMS/get/simDevelopment/1891.html

I used the configuration from the hypernews and ran IgProfService to isolate memory leaks and per-event memory hoarding. Once all the changes were done, I ran it under CMSSW_10_6_ASAN_X_2019-04-26-2300 to double check there were no memory errors (it was clean).

Using the original code running with 8 threads on only 20 events, the memory kept growing and reached RSS=3.6GB After the code change, running on 8 threads over 135 events the job was still around RSS=2GB after 20 and stayed near 2.2GB until the very end of the job when it grew temporarily to RSS=3.4.

Ownership of the hit collection is transfered to G4HCofThisEvent.
…tector

The numbering scheme object is a static thread_local function variable
and therefore should not be deleted explicitly.
AttachSD has not state and therefore can just be used as a
local variable.
All memory except for G4Run (which has segmentation fault problems)
is now cleaned up at the end of the job.
The reusehit container now owns the CaloG4Hits it contains and it
is now cleared at the end of processing an event.
Removed the need for the helper container selIndex.
Made hitvec a temporary variable of a member function.
Can now call ResetStorage on the G4 allocator that holds all the
memory for all CaloG4Hits in a particular thread. This will allow
clearing of the memory at the end of each event.
IgProf showed this was a major memory hoarder.
Now clear temporary memory used to process an event before leaving
OscarMTProducer::produce.
Properly cleanup the thread_local storage at the end of the job.
This makes it easier to find per event memory leaks.
The containers were supposed to be either removed or replaced.
Given there is a RunManagerMTWorker per stream, and the tls is
per threads, we need to be sure not to delete the same tls
multiple times.
@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-26578/9502

  • This PR adds an extra 52KB to repository

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-26578/9503

  • This PR adds an extra 52KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @Dr15Jones (Chris Jones) for master.

It involves the following packages:

SimG4CMS/Calo
SimG4CMS/Tracker
SimG4Core/Application
SimG4Core/SensitiveDetector

@cmsbuild, @civanch, @mdhildreth can you please review it and eventually sign? Thanks.
@makortel this is something you requested to watch as well.
@davidlange6, @slava77, @fabiocos you are the release manager for this.

cms-bot commands are listed here

@civanch
Copy link
Contributor

civanch commented Apr 30, 2019

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 30, 2019

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/34434/console Started: 2019/04/30 22:59

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

Comparison job queued.

@cmsbuild
Copy link
Contributor

cmsbuild commented May 1, 2019

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-26578/34434/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 33
  • DQMHistoTests: Total histograms compared: 3211964
  • DQMHistoTests: Total failures: 1
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3211759
  • DQMHistoTests: Total skipped: 204
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 32 files compared)
  • Checked 137 log files, 14 edm output root files, 33 DQM output files

@civanch
Copy link
Contributor

civanch commented May 1, 2019

@Dr15Jones , great job!
The fact, that there is no difference in regression means, that this PR is correct. Let me some time to understand, where we lost proper clean-up of intermediate calo hits - the problem appears sometime after 10_2. Concerning object deletion end of run: since migration to MT we have a G4 problem, that not all objects are properly deleted end of run. Geant4 was improving in this respect for each new release but this G4 10.4 is still not fully cleaned. Moreover, OscarProducer should delete some G4 objects and I think what you propose in this PR is necessary and is more correct than before.

@Dr15Jones
Copy link
Contributor Author

For completion, I ran 135 events using the original code (all the numbers I've used have been from CMSSW_10_5_0 as that is where the production jobs saw the problem). In that case the job ran around RSS > 3GB for most and then at the end went up to 4GB.

@civanch
Copy link
Contributor

civanch commented May 2, 2019

@Dr15Jones , I run ttbar 600 events for several versions of CMSSW and for this fix. In these events there is no memory grows on such statistics. What I see RSS=1.76 GB before the fix and 1.66 GB after the fix. This may be well explained by clean-up end of event. VSIZE is changed from 3.67 GB to 3.57 GB. What I do not understand - in CMSSW_10_1 for the same events but 2017 geometry we had RSS=1.45 GB and VSIZE= 2.34. If I rerun with this fix and 2017 global tag and era I still have RSS=1.65 GB and VSIZE=3.57 GB.

In summary, this PR really fix memory of G4Allocator and improve deletion at the end of run. However, I do not understand the increase RSS and VSIZE at some moment during 10_2 developments. This may be happens not due to SIM itself but in other part of CMSSW, at least, cannot say this for sure.

Concerning G4Allocator: all objects following G4Allocator must be deleted just after usage, in this case, allocated memory will not grow and will be used efficiently. If they are not deleted the RSS will grow but the memory will be cleaned end of run, so memory leak will not be detected in easy way.

@civanch
Copy link
Contributor

civanch commented May 2, 2019

+1
This PR make an important fix of memory, @fabiocos , please, have a look and merge it.
Some questions still remains unclear, at least, for me, but it is a subject of further investigation.

@cmsbuild
Copy link
Contributor

cmsbuild commented May 2, 2019

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @davidlange6, @slava77, @smuzaffar, @fabiocos (and backports should be raised in the release meeting by the corresponding L2)

@Dr15Jones
Copy link
Contributor Author

@civanch what I found helpful in finding why 'live' memory had increased was to add

process.add_(cms.Service("IgProfService", 
                         reportEventInterval = cms.untracked.int32(10),
                         reportToFileAtPostEvent = cms.untracked.string("igtest_service_3.%E.out")))

to the configuration and then run

igprof -d -mp -z -o igprof_<something unique>.mp.gz -t cmsRunGlibC cmsRunGlibC <test_cfg.py>

and then compare two different outputs from the same job

igprof-analyse -d -v -g -r MEM_LIVE --baseline igtest_service_3.1.out --diff-mode igtest_service_3.11.out

You could do something similar between two different jobs. Use the IgProfService to 'snapshot' the live memory of both jobs at some point in time (say after 10 events) and then use igprof-analyse with --diff-mode to see what memory is different between the two jobs.

@fabiocos
Copy link
Contributor

fabiocos commented May 2, 2019

@civanch did you perhaps compare with the same 600 events without this? Here we see identity, but usually just a few tens of events are run (although if something really wrong happens it should likely appear)

@fabiocos
Copy link
Contributor

fabiocos commented May 3, 2019

I've run 100 TTbar events in the latest IB with and without this PR, finding a set of differences that are anyway minor, what surprises me is that most of the G4 hits level plots are ok, but I see a couple of differences at GEN level (TTbar analysis, one particle of difference), and they are likely the cause of all the subsequent observed differences. Therefore I conclude that those differences should not be mainly linked to this PR (although I'll do an extra check, GEN should be reproducible).

@fabiocos
Copy link
Contributor

fabiocos commented May 3, 2019

+1

@cmsbuild cmsbuild merged commit b950223 into cms-sw:master May 3, 2019
@Dr15Jones Dr15Jones deleted the memoryCleanupG4 branch May 10, 2019 14:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants