Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FlatGeobuf: decrease memory usage when inserting lots of features #6804

Merged
merged 2 commits into from Nov 27, 2022

Conversation

rouault
Copy link
Member

@rouault rouault commented Nov 25, 2022

should help a bit with https://lists.osgeo.org/pipermail/gdal-dev/2022-November/056534.html

Also speed-up things a bit by avoiding use of shared_ptr<>: from 26 s to 23 s when ogr2ogr from/to FlatGeoBuf with a file of 1.8 GB and 3.3 million features.

CC @bjornharrtell

should help a bit with https://lists.osgeo.org/pipermail/gdal-dev/2022-November/056534.html

Also speed-up things a bit by avoiding use of shared_ptr<>: from 26 s to
23 s when ogr2ogr from/to FlatGeoBuf with a file of 1.8 GB and 3.3 million
features.
@bjornharrtell
Copy link
Contributor

Seems sane but it's also performance details of C++ that is beyond me.

That said, the changes to packedrtree.h should be upstreamed.

@rouault
Copy link
Member Author

rouault commented Nov 25, 2022

Seems sane but it's also performance details of C++ that is beyond me.

shared_ptr<> involves a memory allocation for the shared_ptr infrastructure: a pointer and a reference counter + the memory allocation for the item being pointed too. And shared_ptr<> also uses atomic counters, which use (light-weight) locking primitives of the CPU, but still when you do millions of them, this can add up

That said, the changes to packedrtree.h should be upstreamed.

I let you do that ?

@bjornharrtell
Copy link
Contributor

I ponder how that works in rust land...

Agreed, I will bring it upstream no problem. 🙂

@@ -97,7 +97,7 @@ class OGRFlatGeobufLayer final : public OGRLayer, public OGRFlatGeobufBaseLayerI

// creation
bool m_create = false;
std::vector<std::shared_ptr<FlatGeobuf::Item>> m_featureItems; // feature item description used to create spatial index
std::vector<FeatureItem> m_featureItems; // feature item description used to create spatial index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this vector used? Are these side effects intented?

  • Adjacent items (wrt index in vector) will now be adjacent in memory. This can significantly improve certain access patterns. But
  • Each slot in the vector now occupies the size of FeatureItem instead of the size of std::shared_ptr.
  • Reallocations while appending to the vector will now copy FeatureItems instead of smart pointers.

Copy link
Member Author

@rouault rouault Nov 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your points are valid. "Ideally" we'd probably want a some_container<std::vector<FeatureItem>> where the inner std::vector<FeatureItem> is limited to batches of let's say 1 million features to avoid too big memory reallocations, but that would be an implementation complication I'm not motivated enough to deal with.
In practice my change seems to work reasonably well, reducing memory usage and execution time, for the use case I tested (3.3 million features). It is hard to predict which of the current/old strategy or the one of this PR would be better when dealing with much more features (100's of millions or more)

Before

$ /usr/bin/time --verbose ogr2ogr out.fgb nz-building-outlines.fgb
	Command being timed: "ogr2ogr out.fgb nz-building-outlines.fgb"
	User time (seconds): 23.54
	System time (seconds): 3.25
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:26.90
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 489668
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 11
	Minor (reclaiming a frame) page faults: 158929
	Voluntary context switches: 215
	Involuntary context switches: 679
	Swaps: 0
	File system inputs: 1456
	File system outputs: 6857352
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Extract of cat /proc/$(pidof ogr2ogr)/status just before process terminates:

VmPeak:	  607968 kB
VmSize:	  580204 kB
VmLck:	       0 kB
VmPin:	       0 kB
VmHWM:	  488668 kB
VmRSS:	  459888 kB
RssAnon:	  428536 kB
RssFile:	   31352 kB
RssShmem:	       0 kB
VmData:	  441776 kB
VmStk:	     136 kB
VmExe:	      12 kB
VmLib:	   65528 kB
VmPTE:	    1080 kB
VmSwap:	       0 kB

Output of massif-visualizer on the file generated by valgrind --tool=massif ogr2ogr out.fgb nz-building-outlines.fgb:
image

After

$ /usr/bin/time --verbose ogr2ogr out.fgb nz-building-outlines.fgb
	Command being timed: "ogr2ogr out.fgb nz-building-outlines.fgb"
	User time (seconds): 19.09
	System time (seconds): 3.01
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:22.12
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 359356
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 167308
	Voluntary context switches: 185
	Involuntary context switches: 112
	Swaps: 0
	File system inputs: 0
	File system outputs: 6857352
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Extract of cat /proc/$(pidof ogr2ogr)/status just before process terminates:

VmPeak:	  511888 kB
VmSize:	  483856 kB
VmLck:	       0 kB
VmPin:	       0 kB
VmHWM:	  359164 kB
VmRSS:	  330168 kB
RssAnon:	  298068 kB
RssFile:	   32100 kB
RssShmem:	       0 kB
VmData:	  345440 kB
VmStk:	     136 kB
VmExe:	      12 kB
VmLib:	   65516 kB
VmPTE:	     828 kB
VmSwap:	       0 kB

Output of massif-visualizer on the file generated by valgrind --tool=massif ogr2ogr out.fgb nz-building-outlines.fgb:
image

I'm not clear why there are phases where memory usage decrease. Perhaps during the std::vector<> reallocation where you allocate 2 * old_size before releasing old_size. But one should also see that in the Before, perhaps to a lesser extent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Measuring is great and beats all theory. : +1:

Just for completeness, std::unique_ptr would give the smart pointer behaviour without the specific overheads of shared_ptr. And make_shared would improve allocations of std::shared.

"Ideally" we'd probably want a some_container<std::vector> where the inner std::vector<> is limited to batches of let's say 1 million features to avoid too big memory reallocations.

std::deque might be an alternative to std::vector.

Copy link
Member Author

@rouault rouault Nov 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_shared would improve allocations of std::shared.

This is what is used in master currently

std::deque might be an alternative to std::vector.

I've just tested it 83ec3c6 .
RAM usage decreases a bit compared to the current state of this PR. Execution time slightly increases however, but remains faster than master

$ /usr/bin/time --verbose ogr2ogr out.fgb nz-building-outlines.fgb
	Command being timed: "ogr2ogr out.fgb nz-building-outlines.fgb"
	User time (seconds): 20.37
	System time (seconds): 3.13
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:23.52
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 365436
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 112427
	Voluntary context switches: 233
	Involuntary context switches: 565
	Swaps: 0
	File system inputs: 0
	File system outputs: 6857352
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
VmPeak:	  472520 kB
VmSize:	  444724 kB
VmLck:	       0 kB
VmPin:	       0 kB
VmHWM:	  365616 kB
VmRSS:	  336872 kB
RssAnon:	  304476 kB
RssFile:	   32396 kB
RssShmem:	       0 kB
VmData:	  306300 kB
VmStk:	     136 kB
VmExe:	      12 kB
VmLib:	   65524 kB
VmPTE:	     824 kB
VmSwap:	       0 kB

image

This might be a good compromise for all use cases as it doesn't involve the occasional big reallocation pattern of std::vector when its capacity is exhausted.

@bjornharrtell what do you think ? Should we go with 83ec3c6 on top of this PR ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, RAM is precious. 🙂

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

83ec3c6 merged into this PR

Comment on lines +1874 to +1884
FeatureItem item;
item.size = static_cast<uint32_t>(fbb.GetSize());
item.offset = m_writeOffset;
item.nodeItem = {
psEnvelope.MinX,
psEnvelope.MinY,
psEnvelope.MaxX,
psEnvelope.MaxY,
0
};
#if defined(__MINGW32__) && __GNUC__ >= 7
#pragma GCC diagnostic pop
#endif
m_featureItems.push_back(item);
m_featureItems.emplace_back(std::move(item));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of moving an initialized item into emplace_back, is there now way to avoid the initial item on the stack?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Browsing through https://stackoverflow.com/questions/1902832/resize-versus-push-back-in-stdvector-does-it-avoid-an-unnecessary-copy-assig makes me think that the following alternative would be more or less equivalent to the current state of the PR. Actually trying it I can't see any measurable timing difference:

diff --git a/ogr/ogrsf_frmts/flatgeobuf/ogrflatgeobuflayer.cpp b/ogr/ogrsf_frmts/flatgeobuf/ogrflatgeobuflayer.cpp
index 5d3fb3cd18..df3ae9a8e3 100644
--- a/ogr/ogrsf_frmts/flatgeobuf/ogrflatgeobuflayer.cpp
+++ b/ogr/ogrsf_frmts/flatgeobuf/ogrflatgeobuflayer.cpp
@@ -1871,7 +1871,8 @@ OGRErr OGRFlatGeobufLayer::ICreateFeature(OGRFeature *poNewFeature)
         if (c == 0)
             return CPLErrorIO("writing feature");
         if (m_bCreateSpatialIndexAtClose) {
-            FeatureItem item;
+            m_featureItems.resize(m_featureItems.size() + 1);
+            FeatureItem& item = m_featureItems.back();
             item.size = static_cast<uint32_t>(fbb.GetSize());
             item.offset = m_writeOffset;
             item.nodeItem = {
@@ -1881,7 +1882,6 @@ OGRErr OGRFlatGeobufLayer::ICreateFeature(OGRFeature *poNewFeature)
                 psEnvelope.MaxY,
                 0
             };
-            m_featureItems.emplace_back(std::move(item));
         }
         m_writeOffset += c;
 

@rouault rouault merged commit b2c6894 into OSGeo:master Nov 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants