Clear the data buffer after constructing the Objects in a collection #129

tmadlener · 2020-08-27T17:45:22Z

BEGINRELEASENOTES

Reduce memory footprint by clearing intermediately used I/O buffers.

ENDRELEASENOTES

Since the data are copied into the Objs from the m_data buffer that is
filled by I/O there is no use to keep the second copy of the data around
any longer.

It would probably be possible to slightly change the layout of the Obj classes from using a Data data member (as value) to a Data* data member which could then point directly into the I/O buffer of the collections. We would then not need the copy of the data that is currently done in prepareAfterRead. Keeping track of the data would then be a bit more work as it is at the moment, but to the actual user classes this should be completely transparent. However, I am not actually sure how this would impact performance. Both for I/O and for later usage.

In any case this small fix here at least cleans up the I/O buffer after all its contents have been copied into the actual Objs, which should improve the memory footprint. The factor by which it is improved depends on how many different relations a given datatype has.

Since the data are copied into the Objs from the m_data buffer that is filled by I/O there is no use to keep the second copy of the data around any longer.

hegner · 2020-08-27T18:52:45Z

Hi Thomas!
Thanks. There are two reasons for the copy. One is that you do not want to have that additional level of pointer indirection every time you want to access the data (data locality). And the other one is to keep the I/O buffers separated from the event data. That is important for multithreading as the I/O buffers may be used again while an event is being processed.

tmadlener · 2020-08-28T08:16:28Z

Hi Benedikt,

One is that you do not want to have that additional level of pointer indirection every time you want to access the data (data locality).

Yes that makes sense. This might also be a bit use case dependent, but it would be interesting to see whether having that additional pointer indirection could be worth the fact that we could then store all the Data blobs contiguously in memory. E.g. for cases where we only want to loop over a given collection and are only interested in the member data but not the relations.

In the current implementation the actually used objects are stored in a std::deque<Obj*> and to me it is not entirely obvious how contiguous the actual Obj are in memory in the end, since they are created on the heap in prepareAfterRead. From a brief look with gdb it seems like they are pretty much contiguous (at least in the tests), but as far as I know there is no guarantee that this always happens and in principle it is possible that the Objs (and the Data therein) are scattered around all over in memory, while only the Obj* in the std::deque are "more or less" contiguously stored. In that sense it could be beneficial to have all the Data blobs contiguous in one buffer.

In the end we would a "realistic" benchmark case (or better yet, multiple ones), to see if the potential (but maybe unlikely?) scattering of the Obj in memory implies a higher cost than an additional pointer indirection.

And the other one is to keep the I/O buffers separated from the event data. That is important for multithreading as the I/O buffers may be used again while an event is being processed.

Good point. However, as far as I can see this is not done in the current ROOTReader implementation, right? That will always allocate a new Collection and then associate the I/O buffer with the corresponding data branch in the input file and then hand off ownership to the EventStore that requested it. On the ROOTWriter side it is true that we currently re-use the I/O buffers for each event, but also that should be manageable in principle, since we could reset the buffer address before writing of each event. From the looks of it, this is what is currently done in K4FWCore.

hegner · 2020-08-28T08:29:54Z

Yes, benchmarks would be good.
What concerns the multithreading - indeed that is not existing, but I made the overall layout with a particular design in mind. As I was absent from the project for more than a year so I could be a bit out of date of what is still left.
In any case - we should move that to a dedicated thread.

Clear the data buffer after constructing the Objects in a collection

f7271af

Since the data are copied into the Objs from the m_data buffer that is filled by I/O there is no use to keep the second copy of the data around any longer.

hegner merged commit 6a1e506 into AIDASoft:master Aug 28, 2020

tmadlener deleted the clear-io-buffer branch September 7, 2020 15:04

tmadlener mentioned this pull request Sep 24, 2020

cannot write out event data previously read from file #103

Closed

tmadlener mentioned this pull request Nov 12, 2020

Idea: True reference collections #146

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clear the data buffer after constructing the Objects in a collection #129

Clear the data buffer after constructing the Objects in a collection #129

tmadlener commented Aug 27, 2020

hegner commented Aug 27, 2020

tmadlener commented Aug 28, 2020

hegner commented Aug 28, 2020

Clear the data buffer after constructing the Objects in a collection #129

Clear the data buffer after constructing the Objects in a collection #129

Conversation

tmadlener commented Aug 27, 2020

hegner commented Aug 27, 2020

tmadlener commented Aug 28, 2020

hegner commented Aug 28, 2020