New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FlatGeobuf: decrease memory usage when inserting lots of features #6804
Conversation
should help a bit with https://lists.osgeo.org/pipermail/gdal-dev/2022-November/056534.html Also speed-up things a bit by avoiding use of shared_ptr<>: from 26 s to 23 s when ogr2ogr from/to FlatGeoBuf with a file of 1.8 GB and 3.3 million features.
Seems sane but it's also performance details of C++ that is beyond me. That said, the changes to packedrtree.h should be upstreamed. |
shared_ptr<> involves a memory allocation for the shared_ptr infrastructure: a pointer and a reference counter + the memory allocation for the item being pointed too. And shared_ptr<> also uses atomic counters, which use (light-weight) locking primitives of the CPU, but still when you do millions of them, this can add up
I let you do that ? |
I ponder how that works in rust land... Agreed, I will bring it upstream no problem. 🙂 |
@@ -97,7 +97,7 @@ class OGRFlatGeobufLayer final : public OGRLayer, public OGRFlatGeobufBaseLayerI | |||
|
|||
// creation | |||
bool m_create = false; | |||
std::vector<std::shared_ptr<FlatGeobuf::Item>> m_featureItems; // feature item description used to create spatial index | |||
std::vector<FeatureItem> m_featureItems; // feature item description used to create spatial index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this vector used? Are these side effects intented?
- Adjacent items (wrt index in vector) will now be adjacent in memory. This can significantly improve certain access patterns. But
- Each slot in the vector now occupies the size of
FeatureItem
instead of the size ofstd::shared_ptr
. - Reallocations while appending to the vector will now copy
FeatureItem
s instead of smart pointers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your points are valid. "Ideally" we'd probably want a some_container<std::vector<FeatureItem>>
where the inner std::vector<FeatureItem>
is limited to batches of let's say 1 million features to avoid too big memory reallocations, but that would be an implementation complication I'm not motivated enough to deal with.
In practice my change seems to work reasonably well, reducing memory usage and execution time, for the use case I tested (3.3 million features). It is hard to predict which of the current/old strategy or the one of this PR would be better when dealing with much more features (100's of millions or more)
Before
$ /usr/bin/time --verbose ogr2ogr out.fgb nz-building-outlines.fgb
Command being timed: "ogr2ogr out.fgb nz-building-outlines.fgb"
User time (seconds): 23.54
System time (seconds): 3.25
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:26.90
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 489668
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 11
Minor (reclaiming a frame) page faults: 158929
Voluntary context switches: 215
Involuntary context switches: 679
Swaps: 0
File system inputs: 1456
File system outputs: 6857352
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Extract of cat /proc/$(pidof ogr2ogr)/status
just before process terminates:
VmPeak: 607968 kB
VmSize: 580204 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 488668 kB
VmRSS: 459888 kB
RssAnon: 428536 kB
RssFile: 31352 kB
RssShmem: 0 kB
VmData: 441776 kB
VmStk: 136 kB
VmExe: 12 kB
VmLib: 65528 kB
VmPTE: 1080 kB
VmSwap: 0 kB
Output of massif-visualizer on the file generated by valgrind --tool=massif ogr2ogr out.fgb nz-building-outlines.fgb
:
After
$ /usr/bin/time --verbose ogr2ogr out.fgb nz-building-outlines.fgb
Command being timed: "ogr2ogr out.fgb nz-building-outlines.fgb"
User time (seconds): 19.09
System time (seconds): 3.01
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:22.12
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 359356
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 167308
Voluntary context switches: 185
Involuntary context switches: 112
Swaps: 0
File system inputs: 0
File system outputs: 6857352
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Extract of cat /proc/$(pidof ogr2ogr)/status
just before process terminates:
VmPeak: 511888 kB
VmSize: 483856 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 359164 kB
VmRSS: 330168 kB
RssAnon: 298068 kB
RssFile: 32100 kB
RssShmem: 0 kB
VmData: 345440 kB
VmStk: 136 kB
VmExe: 12 kB
VmLib: 65516 kB
VmPTE: 828 kB
VmSwap: 0 kB
Output of massif-visualizer on the file generated by valgrind --tool=massif ogr2ogr out.fgb nz-building-outlines.fgb
:
I'm not clear why there are phases where memory usage decrease. Perhaps during the std::vector<> reallocation where you allocate 2 * old_size before releasing old_size. But one should also see that in the Before, perhaps to a lesser extent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Measuring is great and beats all theory. : +1:
Just for completeness, std::unique_ptr
would give the smart pointer behaviour without the specific overheads of shared_ptr
. And make_shared
would improve allocations of std::shared.
"Ideally" we'd probably want a some_container<std::vector> where the inner std::vector<> is limited to batches of let's say 1 million features to avoid too big memory reallocations.
std::deque
might be an alternative to std::vector
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make_shared would improve allocations of std::shared.
This is what is used in master currently
std::deque might be an alternative to std::vector.
I've just tested it 83ec3c6 .
RAM usage decreases a bit compared to the current state of this PR. Execution time slightly increases however, but remains faster than master
$ /usr/bin/time --verbose ogr2ogr out.fgb nz-building-outlines.fgb
Command being timed: "ogr2ogr out.fgb nz-building-outlines.fgb"
User time (seconds): 20.37
System time (seconds): 3.13
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:23.52
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 365436
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 112427
Voluntary context switches: 233
Involuntary context switches: 565
Swaps: 0
File system inputs: 0
File system outputs: 6857352
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
VmPeak: 472520 kB
VmSize: 444724 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 365616 kB
VmRSS: 336872 kB
RssAnon: 304476 kB
RssFile: 32396 kB
RssShmem: 0 kB
VmData: 306300 kB
VmStk: 136 kB
VmExe: 12 kB
VmLib: 65524 kB
VmPTE: 824 kB
VmSwap: 0 kB
This might be a good compromise for all use cases as it doesn't involve the occasional big reallocation pattern of std::vector when its capacity is exhausted.
@bjornharrtell what do you think ? Should we go with 83ec3c6 on top of this PR ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, RAM is precious. 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
83ec3c6 merged into this PR
FeatureItem item; | ||
item.size = static_cast<uint32_t>(fbb.GetSize()); | ||
item.offset = m_writeOffset; | ||
item.nodeItem = { | ||
psEnvelope.MinX, | ||
psEnvelope.MinY, | ||
psEnvelope.MaxX, | ||
psEnvelope.MaxY, | ||
0 | ||
}; | ||
#if defined(__MINGW32__) && __GNUC__ >= 7 | ||
#pragma GCC diagnostic pop | ||
#endif | ||
m_featureItems.push_back(item); | ||
m_featureItems.emplace_back(std::move(item)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of moving an initialized item into emplace_back
, is there now way to avoid the initial item on the stack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Browsing through https://stackoverflow.com/questions/1902832/resize-versus-push-back-in-stdvector-does-it-avoid-an-unnecessary-copy-assig makes me think that the following alternative would be more or less equivalent to the current state of the PR. Actually trying it I can't see any measurable timing difference:
diff --git a/ogr/ogrsf_frmts/flatgeobuf/ogrflatgeobuflayer.cpp b/ogr/ogrsf_frmts/flatgeobuf/ogrflatgeobuflayer.cpp
index 5d3fb3cd18..df3ae9a8e3 100644
--- a/ogr/ogrsf_frmts/flatgeobuf/ogrflatgeobuflayer.cpp
+++ b/ogr/ogrsf_frmts/flatgeobuf/ogrflatgeobuflayer.cpp
@@ -1871,7 +1871,8 @@ OGRErr OGRFlatGeobufLayer::ICreateFeature(OGRFeature *poNewFeature)
if (c == 0)
return CPLErrorIO("writing feature");
if (m_bCreateSpatialIndexAtClose) {
- FeatureItem item;
+ m_featureItems.resize(m_featureItems.size() + 1);
+ FeatureItem& item = m_featureItems.back();
item.size = static_cast<uint32_t>(fbb.GetSize());
item.offset = m_writeOffset;
item.nodeItem = {
@@ -1881,7 +1882,6 @@ OGRErr OGRFlatGeobufLayer::ICreateFeature(OGRFeature *poNewFeature)
psEnvelope.MaxY,
0
};
- m_featureItems.emplace_back(std::move(item));
}
m_writeOffset += c;
should help a bit with https://lists.osgeo.org/pipermail/gdal-dev/2022-November/056534.html
Also speed-up things a bit by avoiding use of shared_ptr<>: from 26 s to 23 s when ogr2ogr from/to FlatGeoBuf with a file of 1.8 GB and 3.3 million features.
CC @bjornharrtell