FlatGeobuf: decrease memory usage when inserting lots of features #6804

rouault · 2022-11-25T21:29:05Z

should help a bit with https://lists.osgeo.org/pipermail/gdal-dev/2022-November/056534.html

Also speed-up things a bit by avoiding use of shared_ptr<>: from 26 s to 23 s when ogr2ogr from/to FlatGeoBuf with a file of 1.8 GB and 3.3 million features.

CC @bjornharrtell

should help a bit with https://lists.osgeo.org/pipermail/gdal-dev/2022-November/056534.html Also speed-up things a bit by avoiding use of shared_ptr<>: from 26 s to 23 s when ogr2ogr from/to FlatGeoBuf with a file of 1.8 GB and 3.3 million features.

bjornharrtell · 2022-11-25T22:17:22Z

Seems sane but it's also performance details of C++ that is beyond me.

That said, the changes to packedrtree.h should be upstreamed.

rouault · 2022-11-25T22:21:54Z

Seems sane but it's also performance details of C++ that is beyond me.

shared_ptr<> involves a memory allocation for the shared_ptr infrastructure: a pointer and a reference counter + the memory allocation for the item being pointed too. And shared_ptr<> also uses atomic counters, which use (light-weight) locking primitives of the CPU, but still when you do millions of them, this can add up

That said, the changes to packedrtree.h should be upstreamed.

I let you do that ?

bjornharrtell · 2022-11-25T22:25:02Z

I ponder how that works in rust land...

Agreed, I will bring it upstream no problem. 🙂

dg0yt · 2022-11-26T11:31:24Z

ogr/ogrsf_frmts/flatgeobuf/ogr_flatgeobuf.h

@@ -97,7 +97,7 @@ class OGRFlatGeobufLayer final : public OGRLayer, public OGRFlatGeobufBaseLayerI

        // creation
        bool m_create = false;
-        std::vector<std::shared_ptr<FlatGeobuf::Item>> m_featureItems; // feature item description used to create spatial index
+        std::vector<FeatureItem> m_featureItems; // feature item description used to create spatial index


How is this vector used? Are these side effects intented?

Adjacent items (wrt index in vector) will now be adjacent in memory. This can significantly improve certain access patterns. But

Each slot in the vector now occupies the size of FeatureItem instead of the size of std::shared_ptr.

Reallocations while appending to the vector will now copy FeatureItems instead of smart pointers.

Your points are valid. "Ideally" we'd probably want a some_container<std::vector<FeatureItem>> where the inner std::vector<FeatureItem> is limited to batches of let's say 1 million features to avoid too big memory reallocations, but that would be an implementation complication I'm not motivated enough to deal with.
In practice my change seems to work reasonably well, reducing memory usage and execution time, for the use case I tested (3.3 million features). It is hard to predict which of the current/old strategy or the one of this PR would be better when dealing with much more features (100's of millions or more)

Before

$ /usr/bin/time --verbose ogr2ogr out.fgb nz-building-outlines.fgb Command being timed: "ogr2ogr out.fgb nz-building-outlines.fgb" User time (seconds): 23.54 System time (seconds): 3.25 Percent of CPU this job got: 99% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:26.90 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 489668 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 11 Minor (reclaiming a frame) page faults: 158929 Voluntary context switches: 215 Involuntary context switches: 679 Swaps: 0 File system inputs: 1456 File system outputs: 6857352 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0

Extract of cat /proc/$(pidof ogr2ogr)/status just before process terminates:

VmPeak: 607968 kB VmSize: 580204 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 488668 kB VmRSS: 459888 kB RssAnon: 428536 kB RssFile: 31352 kB RssShmem: 0 kB VmData: 441776 kB VmStk: 136 kB VmExe: 12 kB VmLib: 65528 kB VmPTE: 1080 kB VmSwap: 0 kB

Output of massif-visualizer on the file generated by valgrind --tool=massif ogr2ogr out.fgb nz-building-outlines.fgb:

After

$ /usr/bin/time --verbose ogr2ogr out.fgb nz-building-outlines.fgb Command being timed: "ogr2ogr out.fgb nz-building-outlines.fgb" User time (seconds): 19.09 System time (seconds): 3.01 Percent of CPU this job got: 99% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:22.12 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 359356 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 167308 Voluntary context switches: 185 Involuntary context switches: 112 Swaps: 0 File system inputs: 0 File system outputs: 6857352 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0

Extract of cat /proc/$(pidof ogr2ogr)/status just before process terminates:

VmPeak: 511888 kB VmSize: 483856 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 359164 kB VmRSS: 330168 kB RssAnon: 298068 kB RssFile: 32100 kB RssShmem: 0 kB VmData: 345440 kB VmStk: 136 kB VmExe: 12 kB VmLib: 65516 kB VmPTE: 828 kB VmSwap: 0 kB

Output of massif-visualizer on the file generated by valgrind --tool=massif ogr2ogr out.fgb nz-building-outlines.fgb:

I'm not clear why there are phases where memory usage decrease. Perhaps during the std::vector<> reallocation where you allocate 2 * old_size before releasing old_size. But one should also see that in the Before, perhaps to a lesser extent

Measuring is great and beats all theory. : +1:

Just for completeness, std::unique_ptr would give the smart pointer behaviour without the specific overheads of shared_ptr. And make_shared would improve allocations of std::shared.

"Ideally" we'd probably want a some_container<std::vector> where the inner std::vector<> is limited to batches of let's say 1 million features to avoid too big memory reallocations.

std::deque might be an alternative to std::vector.

make_shared would improve allocations of std::shared.

This is what is used in master currently

std::deque might be an alternative to std::vector.

I've just tested it 83ec3c6 .
RAM usage decreases a bit compared to the current state of this PR. Execution time slightly increases however, but remains faster than master

$ /usr/bin/time --verbose ogr2ogr out.fgb nz-building-outlines.fgb Command being timed: "ogr2ogr out.fgb nz-building-outlines.fgb" User time (seconds): 20.37 System time (seconds): 3.13 Percent of CPU this job got: 99% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:23.52 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 365436 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 112427 Voluntary context switches: 233 Involuntary context switches: 565 Swaps: 0 File system inputs: 0 File system outputs: 6857352 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0

VmPeak: 472520 kB VmSize: 444724 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 365616 kB VmRSS: 336872 kB RssAnon: 304476 kB RssFile: 32396 kB RssShmem: 0 kB VmData: 306300 kB VmStk: 136 kB VmExe: 12 kB VmLib: 65524 kB VmPTE: 824 kB VmSwap: 0 kB

This might be a good compromise for all use cases as it doesn't involve the occasional big reallocation pattern of std::vector when its capacity is exhausted.

@bjornharrtell what do you think ? Should we go with 83ec3c6 on top of this PR ?

Sure, RAM is precious. 🙂

83ec3c6 merged into this PR

dg0yt · 2022-11-26T11:33:28Z

ogr/ogrsf_frmts/flatgeobuf/ogrflatgeobuflayer.cpp

+            FeatureItem item;
+            item.size = static_cast<uint32_t>(fbb.GetSize());
+            item.offset = m_writeOffset;
+            item.nodeItem = {
                psEnvelope.MinX,
                psEnvelope.MinY,
                psEnvelope.MaxX,
                psEnvelope.MaxY,
                0
            };
-#if defined(__MINGW32__) && __GNUC__ >= 7
-#pragma GCC diagnostic pop
-#endif
-            m_featureItems.push_back(item);
+            m_featureItems.emplace_back(std::move(item));


Instead of moving an initialized item into emplace_back, is there now way to avoid the initial item on the stack?

Browsing through https://stackoverflow.com/questions/1902832/resize-versus-push-back-in-stdvector-does-it-avoid-an-unnecessary-copy-assig makes me think that the following alternative would be more or less equivalent to the current state of the PR. Actually trying it I can't see any measurable timing difference:

diff --git a/ogr/ogrsf_frmts/flatgeobuf/ogrflatgeobuflayer.cpp b/ogr/ogrsf_frmts/flatgeobuf/ogrflatgeobuflayer.cpp index 5d3fb3cd18..df3ae9a8e3 100644 --- a/ogr/ogrsf_frmts/flatgeobuf/ogrflatgeobuflayer.cpp +++ b/ogr/ogrsf_frmts/flatgeobuf/ogrflatgeobuflayer.cpp @@ -1871,7 +1871,8 @@ OGRErr OGRFlatGeobufLayer::ICreateFeature(OGRFeature *poNewFeature) if (c == 0) return CPLErrorIO("writing feature"); if (m_bCreateSpatialIndexAtClose) { - FeatureItem item; + m_featureItems.resize(m_featureItems.size() + 1); + FeatureItem& item = m_featureItems.back(); item.size = static_cast<uint32_t>(fbb.GetSize()); item.offset = m_writeOffset; item.nodeItem = { @@ -1881,7 +1882,6 @@ OGRErr OGRFlatGeobufLayer::ICreateFeature(OGRFeature *poNewFeature) psEnvelope.MaxY, 0 }; - m_featureItems.emplace_back(std::move(item)); } m_writeOffset += c;

dg0yt reviewed Nov 26, 2022

View reviewed changes

FlatGeoBuf: use std::deque<>

83ec3c6

rouault merged commit b2c6894 into OSGeo:master Nov 27, 2022

bjornharrtell mentioned this pull request Dec 2, 2022

FlatGeobuf: Align with upstream #6838

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FlatGeobuf: decrease memory usage when inserting lots of features #6804

FlatGeobuf: decrease memory usage when inserting lots of features #6804

rouault commented Nov 25, 2022

bjornharrtell commented Nov 25, 2022

rouault commented Nov 25, 2022 •

edited

bjornharrtell commented Nov 25, 2022

dg0yt Nov 26, 2022

rouault Nov 26, 2022 •

edited

dg0yt Nov 26, 2022

rouault Nov 26, 2022 •

edited

bjornharrtell Nov 26, 2022

rouault Nov 26, 2022

dg0yt Nov 26, 2022

rouault Nov 26, 2022

FlatGeobuf: decrease memory usage when inserting lots of features #6804

FlatGeobuf: decrease memory usage when inserting lots of features #6804

Conversation

rouault commented Nov 25, 2022

bjornharrtell commented Nov 25, 2022

rouault commented Nov 25, 2022 • edited

bjornharrtell commented Nov 25, 2022

dg0yt Nov 26, 2022

Choose a reason for hiding this comment

rouault Nov 26, 2022 • edited

Choose a reason for hiding this comment

Before

After

dg0yt Nov 26, 2022

Choose a reason for hiding this comment

rouault Nov 26, 2022 • edited

Choose a reason for hiding this comment

bjornharrtell Nov 26, 2022

Choose a reason for hiding this comment

rouault Nov 26, 2022

Choose a reason for hiding this comment

dg0yt Nov 26, 2022

Choose a reason for hiding this comment

rouault Nov 26, 2022

Choose a reason for hiding this comment

rouault commented Nov 25, 2022 •

edited

rouault Nov 26, 2022 •

edited

rouault Nov 26, 2022 •

edited