Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Abseil hash maps instead of std unordered containers for a huge speedup #4049

Open
lrineau opened this issue Jul 2, 2019 · 33 comments
Open
Assignees
Milestone

Comments

@lrineau
Copy link
Member

lrineau commented Jul 2, 2019

Issue Details

Inspired by one of the cppcon talks I listen to during my free time, I have quickly tried the hash maps from Abseil instead of std::unordered_map.

My try was a modification of copy_face_graph:

diff --git a/BGL/include/CGAL/boost/graph/copy_face_graph.h b/BGL/include/CGAL/boost/graph/copy_face_graph.h
index 9a2cb628825..fd0b602f284 100644
--- a/BGL/include/CGAL/boost/graph/copy_face_graph.h
+++ b/BGL/include/CGAL/boost/graph/copy_face_graph.h
@@ -35,6 +35,7 @@
 #include <boost/unordered_map.hpp>
 #include <boost/utility/enable_if.hpp>
 #include <boost/function_output_iterator.hpp>
+#include <absl/container/flat_hash_map.h>
 
 namespace CGAL {
 
@@ -189,7 +190,7 @@ void copy_face_graph(const SourceMesh& sm, TargetMesh& tm,
   typedef typename boost::graph_traits<SourceMesh>::halfedge_descriptor sm_halfedge_descriptor;
   typedef typename boost::graph_traits<TargetMesh>::halfedge_descriptor tm_halfedge_descriptor;
 
-  boost::unordered_map<sm_halfedge_descriptor,
+  absl::flat_hash_map<sm_halfedge_descriptor,
                        tm_halfedge_descriptor> hash_map(num_halfedges(sm));
   copy_face_graph_impl(sm, tm,
                        boost::make_assoc_property_map(hash_map),

The results are stuning: now 3.2 seconds instead of 7.1 seconds

Source Code

The .cpp code: https://gist.github.com/2e9ea602a7b1e46cd34773611acac3c6

Here is the local patch to CMakeLists.txt (a non-clean hack):

diff --git a/BGL/examples/BGL_polyhedron_3/CMakeLists.txt b/BGL/examples/BGL_polyhedron_3/CMakeLists.txt
index ac3bd5256f2..a821195ff9c 100644
--- a/BGL/examples/BGL_polyhedron_3/CMakeLists.txt
+++ b/BGL/examples/BGL_polyhedron_3/CMakeLists.txt
@@ -67,6 +67,11 @@ create_single_source_cgal_program( "transform_iterator.cpp" )
 
 create_single_source_cgal_program( "copy_polyhedron.cpp" )
 
+add_subdirectory(googletest)
+add_subdirectory(abseil-cpp)
+
+target_link_libraries(copy_polyhedron PRIVATE absl::flat_hash_map)
+
 if(OpenMesh_FOUND)
   target_link_libraries( copy_polyhedron PRIVATE ${OPENMESH_LIBRARIES} )
 endif()

I had to checkout abseil and gtest:

$ git clone git@github.com:abseil/abseil-cpp.git
$ git clone git@github.com:google/googletest.git

I ran that modified copy_polyhedron.cpp on demo/Polyhedron/data/man.off, and as I said the speedup is huge: 3.2s after the patch to <CGAL/boost/graph/copy_face_graph.h>, instead of 7.1s.

Environment

  • Operating system (Windows/Mac/Linux, 32/64 bits):
  • Compiler: g++ 9.1.1 with -O3
  • Release or debug mode: Release
  • CGAL version: CGAL-5.0-dev, master version 44e4d15
  • Boost version: 1.69.0
  • Other libraries versions if used (Eigen, TBB, etc.):
@lrineau
Copy link
Member Author

lrineau commented Jul 2, 2019

https://github.com/abseil/abseil-cpp license is Apache License 2.0. It is very permissive.

@MaelRL
Copy link
Member

MaelRL commented Jul 2, 2019

https://martin.ankerl.com/2019/04/01/hashmap-benchmarks-01-overview/ had some rather detailed comparisons recently, which seem to yield similar observations.

@sloriot
Copy link
Member

sloriot commented Jul 4, 2019

Shall @maxGimeno propose a global plan with the introduction of CGAL::unordered_map + the cmake machinery to get it everywhere?

@mglisse
Copy link
Member

mglisse commented Jul 4, 2019

abseil is rather inconvenient to use (without bazel or cmake): you need to link about 30 libraries to get a trivial example working 😞
But yes, anything that spends a lot of time in hashtable operations should be benchmarked with other implementations.
Note that in one extreme case (unordered_set was taking > 64G in this application) I had better results with the standard container than with abseil.

@lrineau
Copy link
Member Author

lrineau commented Jul 4, 2019

@mglisse commented on Jul 4, 2019, 9:14 AM GMT+2:

abseil is rather inconvenient to use (without bazel or cmake): you need to link about 30 libraries to get a trivial example working 😞

Hopefully CGAL uses CMake.

add_subdirectory(googletest)
add_subdirectory(abseil-cpp)

target_link_libraries(copy_polyhedron PRIVATE absl::flat_hash_map)

But yes, anything that spends a lot of time in hashtable operations should be benchmarked with other implementations.
Note that in one extreme case (unordered_set was taking > 64G in this application) I had better results with the standard container than with abseil.

Our opting for Abseil will be conditional to a macro. With it we will easily be able to build benchmarks. Even users should be made aware of that, so that they can tune.

@mglisse
Copy link
Member

mglisse commented Jul 4, 2019

Hopefully CGAL uses CMake.

As a header-only library, it doesn't matter what CGAL uses (if it uses anything at all). Providing CGALConfig.cmake does help. But users of CGAL don't all use cmake.

@lrineau
Copy link
Member Author

lrineau commented Jul 4, 2019

@mglisse commented on Jul 4, 2019, 9:38 AM GMT+2:

Hopefully CGAL uses CMake.

As a header-only library, it doesn't matter what CGAL uses (if it uses anything at all). Providing CGALConfig.cmake does help. But users of CGAL don't all use cmake.

We do not care. Abseil will be an optional dependency of CGAL, and our CMake users will get it almost for free. As for the others... they can do experiments and benchmarks using CMake, and them make the effort in their build system if the speedup is compelling.

Actually, I do not get your point. Do you have a suggestion about an alternative?

@afabri
Copy link
Member

afabri commented Jul 4, 2019

Just my two cents, it would be good to see a higher level CGAL algorithm than copy_face_graph() as a motivation.

@lrineau
Copy link
Member Author

lrineau commented Jul 4, 2019

@mglisse commented on Jul 4, 2019, 9:14 AM GMT+2:

abseil is rather inconvenient to use (without bazel or cmake): you need to link about 30 libraries to get a trivial example working 😞

Actually, can you elaborate? Because on my machine abseil seems to require only the pthread support, and no other dependency.

@maxGimeno
Copy link
Contributor

@afabri it concerns everything related to unordered maps, not only copy_face_graph().

@afabri
Copy link
Member

afabri commented Jul 4, 2019

The questions is if it has measurable impact on Mesh_3, Constrained_triangulation_2, Arrangement_2......

@maxGimeno
Copy link
Contributor

We will probably have to bench to be sure. But I have seen traces of unordered maps at least in Mesh_3 and Arrangements_on_surface_2, so there will probbaly be an impact.

@lrineau
Copy link
Member Author

lrineau commented Jul 4, 2019

For at least one user of us, CGAL::PMP::polygon_soup_to_polygon_mesh was a big consumer of time of their process. And it is all about maps.

@mglisse
Copy link
Member

mglisse commented Jul 4, 2019

Actually, can you elaborate? Because on my machine abseil seems to require only the pthread support, and no other dependency.

I didn't mean external dependencies. Using a trivial program:

#include <absl/container/flat_hash_set.h>
typedef absl::flat_hash_set<int> S;
int main(){ S s; }

Complete the command line with a minimal number of -l flags:
g++ test.cc -I .../include -L .../lib
Here I need -labsl_hashtablez_sampler -labsl_synchronization -labsl_time -labsl_base -labsl_spinlock_wait -labsl_time_zone -labsl_int128 -labsl_symbolize -labsl_stacktrace -labsl_malloc_internal -labsl_demangle_internal -labsl_debugging_internal -labsl_dynamic_annotations -pthread ... Why they didn't at least create a thin archive or linker script so that -labsl would work is an interesting question.

@mglisse
Copy link
Member

mglisse commented Jul 4, 2019

Actually, I do not get your point.

Just that abseil is a bit inconvenient to use, not as easy as using more Boost stuff for instance (even the parts that require linking with a library).

I don't know yet if you plan to test for the presence of abseil or to include a copy of it.

Do you have a suggestion about an alternative?

Not particularly. Mael posted a link to a comparison between many implementations, some of which are header-only. They don't all have Google maintaining them though, and you may wish to use more stuff from abseil later.

@lrineau lrineau changed the title Use Abseil hash maps instead of std unordered containers for a hugde speedup Use Abseil hash maps instead of std unordered containers for a huge speedup Jul 4, 2019
@lrineau
Copy link
Member Author

lrineau commented Jul 4, 2019

I would love to use the single-header implementation robin_hood.h, that is said to be faster by those benchmarks, but it is not compatible, for the moment, with our use of -frounding-math:

robin_hood.h:1901:59: error: ‘(8.0e+1 / 1.0e+2)’ is not a constant expression
 1901 |         static constexpr double factor = MaxLoadFactor100 / 100.0;
      |                                          ~~~~~~~~~~~~~~~~~^~~~~~~

@lrineau
Copy link
Member Author

lrineau commented Jul 4, 2019

@lrineau commented on Jul 4, 2019, 3:05 PM GMT+2:

I would love to use the single-header implementation robin_hood.h, that is said to be faster by those benchmarks, but it is not compatible, for the moment, with our use of -frounding-math:

robin_hood.h:1901:59: error: ‘(8.0e+1 / 1.0e+2)’ is not a constant expression
 1901 |         static constexpr double factor = MaxLoadFactor100 / 100.0;
      |                                          ~~~~~~~~~~~~~~~~~^~~~~~~

I have artificially removed -frounding-math, to compile the test, and it is slower than using abseil: 3.8 seconds instead of 3.2 using abseil.

@mglisse
Copy link
Member

mglisse commented Jul 4, 2019

our use of -frounding-math

I don't remember if you tried recently to remove it, to see what breaks. But that's not the right place to discuss it.

I guess constexpr is a context where gcc should ignore rounding modes...

it is slower than using abseil

The list also includes phmap, which is described as essentially a headerified version of abseil.

@MaelRL MaelRL added this to the 5.3-beta milestone Dec 2, 2020
@GilesBathgate
Copy link
Contributor

Hi @lrineau, I've been experimenting with replacing the map in Unique_hash_map with std::unordered_map but initial results showed this to be slower than the chained_map implementation. Probably I did something wrong or I am not understanding the use case of Unique_hash_map correctly:

https://gist.github.com/GilesBathgate/3fb78d3949ca52f4d1a0d14dcc66ae23

Would this benefit from the abseil version of map?

@mglisse
Copy link
Member

mglisse commented Feb 8, 2021

I've been experimenting with replacing the map in Unique_hash_map with std::unordered_map but initial results showed this to be slower than the chained_map implementation.

The whole point of this issue is that std::unordered_map is slow, what did you expect?

@GilesBathgate
Copy link
Contributor

GilesBathgate commented Feb 8, 2021

@mglisse 😂 Yes, But my reason for investigating it (prior to coming across this PR) was that in some cases, I've found anything other than chained_map to be faster, especially when the number of elements is small because chained_map doesn't lazily initialise minimum of 512 buckets.

For example here:

CGAL::Unique_hash_map<SVertex_const_iterator,bool> visited(false);

where visited only contains approximately 6 to 10 elements. The chained_map construction dominates the CPU usage, making O(1) lookup irrelevant probably a std::vector with O(n) would be faster in this scenario. Another example where Unique_hash_map construction is slow is #5143

@lrineau
Copy link
Member Author

lrineau commented Feb 8, 2021

Hi @lrineau, I've been experimenting with replacing the map in Unique_hash_map with std::unordered_map but initial results showed this to be slower than the chained_map implementation. Probably I did something wrong or I am not understanding the use case of Unique_hash_map correctly:

gist.github.com/GilesBathgate/3fb78d3949ca52f4d1a0d14dcc66ae23_ (too large to embed)_

Would this benefit from the abseil version of map?

Hi Giles, I do not have the time to investigate on your issue. I hope others will, though.

The best way to know if the Abseil hash maps can benefit your code is to try, anyway.

@GilesBathgate
Copy link
Contributor

@lrineau Thanks I've given it a try

https://gist.github.com/GilesBathgate/3fb78d3949ca52f4d1a0d14dcc66ae23

@mglisse Abseil was 20% slower than std:unordered_map, for me though so I'm confused now.

@GilesBathgate
Copy link
Contributor

@lrineau @mglisse I am getting good results with robin_hood.h I use robin_hood::unordered_node_map (The issue with -frounding math was resolved martinus/robin-hood-hashing#51)

Updated gist:
https://gist.github.com/GilesBathgate/3fb78d3949ca52f4d1a0d14dcc66ae23

@mglisse
Copy link
Member

mglisse commented Feb 11, 2021

where visited only contains approximately 6 to 10 elements

Did you try a flat_map for this one? (I know nothing about that code, and in particular not if the size of this container can ever be large) Such small maps are not a typical use case for hash maps, so many are probably not optimized for that case. The code you point to seems suboptimal, it calls visited[v] twice in a row, changing the code a bit, a (hash_)set would make more sense than a map (it probably used a map because there was no set in cgal), checking visited.insert(v).second.

Nice if robin hood works well, although if you want to change Unique_hash_map, which is used a lot in CGAL, and not just use a different map in one specific place, reviewers are likely to ask for various benchmarks, to make sure it doesn't regress some other use cases (bigger table, more expensive key, etc). But it seems quite possible that it is indeed faster all around.

@GilesBathgate
Copy link
Contributor

@mglisse Yes, I tried a boost::flat_set and it was much faster I'll add a comment about that to #5379

Regarding Unique_hash_map if I understand correctly it's optimised for when the keys are known to be unique and therefore no collisions. That said robin_hood does seem to out-perform chained_map in my tests, especially in the object construction. Either way, I've found some that some cases require robin_hood::unordered_node_map instead of robin_hood::unordered_flat_map So it doesn't get the cache friendliness benefits. I was thinking of asking @martinus if there is a way to use robin_hood::* in a way that is optimal for unique hash maps.

@martinus
Copy link

If you know the maximum size of the hashmap, it is beneficial to call reserve() with that amount. That way you only get a single allocation. Not much else you can do as far as I see on a quick glance.

Also, as @mglisse said, the linked code looks suboptimal. E.g. unordered_flat_set might help, and I think the std::list could use a custom allocator so it doesn't malloc & free each time you put an element into it. A while ago I've created an allocator for that: https://gist.github.com/martinus/ee4cc9d82d4610647180301c47b030a2 see e.g. this benchmark: https://quick-bench.com/q/BLW3OTlodJ2c3TmQGJUPT_TAdGA

@mglisse
Copy link
Member

mglisse commented Feb 11, 2021

the std::list could

... be wrapped in a std::queue, which could then use its default backend deque 😉
(I didn't profile that code, no idea what part is relevant to optimize)

@GilesBathgate
Copy link
Contributor

If you know the maximum size of the hashmap, it is beneficial to call reserve() with that amount.

So this could be done:

-  CGAL::Unique_hash_map<SVertex_const_iterator,bool> visited(false);
+  robin_hood::unordered_flat_set<SVertex_const_iterator,Handle_hash_function> visited;
+  visited.reserve(number_of_svertices());

but unfortunately:

Size_type number_of_svertices() const
/*{\Mop returns the number of vertices.}*/
{ Size_type n(0);
SVertex_const_iterator vit;
CGAL_forall_svertices(vit, *this) ++n;
return n; }

😞

@GilesBathgate
Copy link
Contributor

GilesBathgate commented Feb 11, 2021

@mglisse Presumably one can use return CGAL::iterator_distance(svertices_begin(), svertices_end()); (or just std::distance) in number_of_svertices?

@mglisse
Copy link
Member

mglisse commented Feb 11, 2021

@mglisse Presumably one can use return CGAL::iterator_distance(svertices_begin(), svertices_end()); (or just std::distance) in number_of_svertices?

Probably, but why? It would perform exactly the same iteration.

@GilesBathgate
Copy link
Contributor

Probably, but why? It would perform exactly the same iteration.

Because I thought it was a random access iterator, so last-first, it's not of course, so you are right.

@GilesBathgate
Copy link
Contributor

GilesBathgate commented Feb 11, 2021

I know nothing about that code, and in particular not if the size of this container can ever be large

@mglisse Yes it can be large if you have a cone with 10,000 sides, the sphere_map of the apex will have 10,000 svertices. so visited would need that many elements. robin_hood::unordered_flat_map with .reserve does seem to be a good option though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants