New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decrease memory used by OscarMTProducer #26578
Conversation
Ownership of the hit collection is transfered to G4HCofThisEvent.
…tector The numbering scheme object is a static thread_local function variable and therefore should not be deleted explicitly.
AttachSD has not state and therefore can just be used as a local variable.
All memory except for G4Run (which has segmentation fault problems) is now cleaned up at the end of the job.
The reusehit container now owns the CaloG4Hits it contains and it is now cleared at the end of processing an event. Removed the need for the helper container selIndex. Made hitvec a temporary variable of a member function.
Can now call ResetStorage on the G4 allocator that holds all the memory for all CaloG4Hits in a particular thread. This will allow clearing of the memory at the end of each event. IgProf showed this was a major memory hoarder.
Now clear temporary memory used to process an event before leaving OscarMTProducer::produce. Properly cleanup the thread_local storage at the end of the job. This makes it easier to find per event memory leaks.
The containers were supposed to be either removed or replaced.
Given there is a RunManagerMTWorker per stream, and the tls is per threads, we need to be sure not to delete the same tls multiple times.
The code-checks are being triggered in jenkins. |
-code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-26578/9502
Code check has found code style and quality issues which could be resolved by applying following patch(s)
|
The code-checks are being triggered in jenkins. |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-26578/9503
|
A new Pull Request was created by @Dr15Jones (Chris Jones) for master. It involves the following packages: SimG4CMS/Calo @cmsbuild, @civanch, @mdhildreth can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
please test |
The tests are being triggered in jenkins. |
Comparison job queued. |
Comparison is ready Comparison Summary:
|
@Dr15Jones , great job! |
For completion, I ran 135 events using the original code (all the numbers I've used have been from CMSSW_10_5_0 as that is where the production jobs saw the problem). In that case the job ran around RSS > 3GB for most and then at the end went up to 4GB. |
@Dr15Jones , I run ttbar 600 events for several versions of CMSSW and for this fix. In these events there is no memory grows on such statistics. What I see RSS=1.76 GB before the fix and 1.66 GB after the fix. This may be well explained by clean-up end of event. VSIZE is changed from 3.67 GB to 3.57 GB. What I do not understand - in CMSSW_10_1 for the same events but 2017 geometry we had RSS=1.45 GB and VSIZE= 2.34. If I rerun with this fix and 2017 global tag and era I still have RSS=1.65 GB and VSIZE=3.57 GB. In summary, this PR really fix memory of G4Allocator and improve deletion at the end of run. However, I do not understand the increase RSS and VSIZE at some moment during 10_2 developments. This may be happens not due to SIM itself but in other part of CMSSW, at least, cannot say this for sure. Concerning G4Allocator: all objects following G4Allocator must be deleted just after usage, in this case, allocated memory will not grow and will be used efficiently. If they are not deleted the RSS will grow but the memory will be cleaned end of run, so memory leak will not be detected in easy way. |
+1 |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @davidlange6, @slava77, @smuzaffar, @fabiocos (and backports should be raised in the release meeting by the corresponding L2) |
@civanch what I found helpful in finding why 'live' memory had increased was to add
to the configuration and then run
and then compare two different outputs from the same job
You could do something similar between two different jobs. Use the IgProfService to 'snapshot' the live memory of both jobs at some point in time (say after 10 events) and then use |
@civanch did you perhaps compare with the same 600 events without this? Here we see identity, but usually just a few tens of events are run (although if something really wrong happens it should likely appear) |
I've run 100 TTbar events in the latest IB with and without this PR, finding a set of differences that are anyway minor, what surprises me is that most of the G4 hits level plots are ok, but I see a couple of differences at GEN level (TTbar analysis, one particle of difference), and they are likely the cause of all the subsequent observed differences. Therefore I conclude that those differences should not be mainly linked to this PR (although I'll do an extra check, GEN should be reproducible). |
+1 |
PR description:
Decrease the memory held by OscarMTProducer during event processing. This is accomplished by
PR validation:
This was triggered from the production problem here: https://hypernews.cern.ch/HyperNews/CMS/get/simDevelopment/1891.html
I used the configuration from the hypernews and ran IgProfService to isolate memory leaks and per-event memory hoarding. Once all the changes were done, I ran it under CMSSW_10_6_ASAN_X_2019-04-26-2300 to double check there were no memory errors (it was clean).
Using the original code running with 8 threads on only 20 events, the memory kept growing and reached RSS=3.6GB After the code change, running on 8 threads over 135 events the job was still around RSS=2GB after 20 and stayed near 2.2GB until the very end of the job when it grew temporarily to RSS=3.4.