[registration] Add OMP based Multi-threading to IterativeClosestPoint by koide3 · Pull Request #4520 · PointCloudLibrary/pcl

koide3 · 2020-11-18T10:55:28Z

This PR adds multi-threading capability to pcl::IterativeClosestPoint and pcl::CorrespondenceEstimation. pcl::ICP spends about 95% of time for correspondence estimation, and thus it can be largely sped up by making CorrespondenceEstimation multi-threaded. Some classes that inherit from pcl::ICP can also be sped up with this PR (e.g., pcl::IterativeClosestPointNonLinear).

Note: The sort at the end of CorrespondenceEstimation::determineCorrespondences() is necessary to pass some tests because they assume correspondences are ordered. I also tried omp ordered to keep correspondences ordered, but it was very slow.

Off-topic: Should we make a common interface for setNumberOfThreads()?

Some benchmark results:

# aligning 16k points

* with omp
threads: time
1:153.565[msec]
2:87.3185[msec]
3:67.8443[msec]
4:54.0599[msec]
5:45.6129[msec]
6:40.9252[msec]
7:43.6004[msec]
8:40.2708[msec]
9:46.0424[msec]

* without omp
1:149.439[msec]
2:150.391[msec]
3:152.18[msec]
4:153.475[msec]
5:149.943[msec]
6:149.483[msec]
7:148.78[msec]
8:149.607[msec]
9:149.231[msec]

* without sort
1:149.21[msec]
2:84.0002[msec]
3:60.262[msec]
4:47.0388[msec]
5:39.0977[msec]
6:36.8624[msec]
7:36.5644[msec]
8:32.6782[msec]
9:39.3133[msec]

* use omp ordered instead of sort
1:209.918[msec]
2:196.892[msec]
3:197.966[msec]
4:196.945[msec]
5:197.686[msec]
6:200.591[msec]
7:200.022[msec]
8:208.916[msec]
9:223.545[msec]


* with omp and use_reciprocal_correspondence_ == true
1:331.359[msec]
2:180.315[msec]
3:130.543[msec]
4:106.585[msec]
5:91.2735[msec]
6:88.4035[msec]
7:94.627[msec]
8:99.4426[msec]
9:97.6912[msec]

* without omp and use_reciprocal_correspondence_ == true
1:329.28[msec]
2:324.217[msec]
3:322.527[msec]
4:322.287[msec]
5:322.802[msec]
6:326.154[msec]
7:329.858[msec]
8:333.565[msec]
9:336.588[msec]

kunaltyagi · 2020-11-19T08:34:45Z

Off-topic: Should we make a common interface for setNumberOfThreads()?

When @shrijitsingh99 is able to integrate the executor and test, we would be one step closer. He mentioned he'd have cycles free near the winter holidays
But yes, in principle I concur with you on this

JStech

Looks like a big improvement. I just saw a couple places that unsigned integers could be used.

JStech · 2020-11-20T15:09:10Z

+  /** \brief Set the number of threads to use.
+   * \param nr_threads the number of hardware threads to use (0 sets the value back to automatic)
+   */
+  void setNumberOfThreads(int nr_threads) {


nr_threads should be unsigned

JStech · 2020-11-20T15:09:30Z

   * will never be recomputed*/
  bool force_no_recompute_reciprocal_;
+
+  int num_threads_;


num_threads_ should also be unsigned

kunaltyagi · 2020-11-20T15:50:16Z

+
+  if(num_threads_ != 1) {
+    // Make correspondences ordered
+    std::sort(correspondences.begin(), correspondences.end(), [](const auto& lhs, const auto& rhs) { return lhs.index_query < rhs.index_query; });


Since you're sorting here, shall we have a few alternate strategies to benchmark against?

Separate correspondences per thread and merge-sort them in the end

Critical section for binary search and inserting, so there's no need for a search in the end

I think the 2nd option might be the slowest, but that's just a hunch

Good suggestion. I compared the following strategies:

Use atomic and sort (the first commit)

Separate correspondences per-thread and merge them after the loop (the last commit)

Do binary search and insert in critical section

Use OMP user-defined reduction

Here is a short summary of the benchmark results, and I think the 2nd strategy is the best one.

Sort gets slower as the number of threads increases

Almost constant merge time and small overhead

It was the slowest as expected

No post process time, but slightly larger overhead

Regarding the part of inplace_merge, although it would be better to merge results in a divide-and-conquer way, I think it makes the code much complex. Since we can expect num_threads is not so large, I consider the current form (merging results sequentially) is sufficient.

Benchmark results:

# 69088pts vs 69792pts # CorrespondenceEstimation::determineCorrespondences() # without omp (baseline) num_threads:total time 1:169.906[msec] 2:177.868[msec] 3:172.066[msec] 4:169.009[msec] 5:169.16[msec] # 1. atomic and sort num_threads: for loop, post process(sort), total time 1:169.316[msec] 5e-05[msec] 169.327[msec] 2:98.9463[msec] 1.64847[msec] 100.606[msec] 3:108.082[msec] 2.1331[msec] 110.225[msec] 4:66.546[msec] 2.39187[msec] 68.9498[msec] 5:62.1945[msec] 2.65166[msec] 64.8581[msec] 6:52.2621[msec] 3.33361[msec] 55.6112[msec] 7:60.9466[msec] 2.83809[msec] 63.7972[msec] 8:45.5633[msec] 2.79004[msec] 48.3653[msec] 9:48.2131[msec] 2.88753[msec] 51.1126[msec] # 2. per-thread correspondences and inplace_merge num_threads: for loop, post process(merge), total time 1:167.489[msec] 4.8e-05[msec] 167.502[msec] 2:93.6329[msec] 0.490607[msec] 94.138[msec] 3:100.873[msec] 0.604739[msec] 101.493[msec] 4:65.5067[msec] 0.402887[msec] 65.9252[msec] 5:61.0197[msec] 0.674917[msec] 61.7172[msec] 6:60.6732[msec] 0.529535[msec] 61.2234[msec] 7:50.8765[msec] 0.557357[msec] 51.449[msec] 8:61.3818[msec] 0.605387[msec] 62.0017[msec] 9:48.0522[msec] 0.634407[msec] 48.7004[msec] # 3. insert in critical section num_threads: total time 1:3174.6[msec] 2:4121.34[msec] 3:4462.44[msec] 4:4753.24[msec] 5:4854.75[msec] # 4. OMP user-defined reduction num_threads: for loop, post process, total time 1:174.546[msec] 4.6e-05[msec] 174.638[msec] 2:94.445[msec] 5.2e-05[msec] 94.4573[msec] 3:102.331[msec] 7.6e-05[msec] 102.343[msec] 4:68.7443[msec] 0.000117[msec] 68.8659[msec] 5:68.1277[msec] 4e-05[msec] 68.1397[msec] 6:53.522[msec] 4.4e-05[msec] 53.5339[msec] 7:62.1336[msec] 0.000172[msec] 62.1449[msec] 8:57.7286[msec] 0.000171[msec] 57.7404[msec] 9:58.9015[msec] 0.000131[msec] 58.915[msec]

Code of the 4th strategy (omp reduction) was something like the following one:

#pragma omp declare reduction \ (merge: pcl::Correspondences : omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()) ) pcl::Correspondences corrs; #pragma omp parallel for \ default(none) \ shared(tree_, indices_, max_dist_sqr) \ firstprivate(index, distance) \ reduction(merge : corrs) \ num_threads(num_threads_) for (int i = 0; i < static_cast<int>(indices_->size()); i++) { const auto& idx = (*indices_)[i]; tree_->nearestKSearch((*input_)[idx], 1, index, distance); if (distance[0] > max_dist_sqr) continue; pcl::Correspondence corr; corr.index_query = idx; corr.index_match = index[0]; corr.distance = distance[0]; corrs.push_back(corr); } correspondences.swap(corrs);

Nice experiment @koide3 ! ❤️
Thanks a lot for the follow-through. Leaving this un-resolved so others can see this too 😄

kunaltyagi

Failing tests

…bles)

larshg · 2020-12-20T21:57:02Z

Build errors in tutorials:

 error: there are no arguments to 'omp_get_thread_num' that depend on a template parameter, so a declaration of 'omp_get_thread_num' must be available [-fpermissive]
       per_thread_correspondences[omp_get_thread_num()].emplace_back(std::move(corr));

Is it a missing include or?

koide3 · 2020-12-22T05:15:15Z

Build errors in tutorials:

I wrapped omp_get_thread_num in #ifdef _OPENMP so it doesn't cause errors on omp-disabled environments.

mvieth

Sorry for taking so long with the review.
Is the inplace_merge at the end even necessary? Wouldn't the correspondences be automatically sorted after the move? At least for static scheduling? I probably overlooked something there 😄

keineahnung2345 · 2024-01-02T06:25:01Z

This seems useful. Any chance to get this moved forward?

mvieth · 2024-01-05T13:52:38Z

This seems useful. Any chance to get this moved forward?

I will take another look

mvieth · 2024-01-20T14:17:42Z

I am happy with the changes now. I will merge this after PCL 1.14.1 is released because adding the num_threads_ variable breaks the ABI, and we don't want an ABI break between 1.14.0 and 1.14.1

Make ICP multi-threaded

811290f

koide3 requested review from kunaltyagi and mvieth November 18, 2020 10:55

koide3 added module: registration needs: code review Specify why not closed/merged yet labels Nov 18, 2020

JStech reviewed Nov 20, 2020

View reviewed changes

kunaltyagi reviewed Nov 20, 2020

View reviewed changes

koide3 added 2 commits November 22, 2020 15:11

Use unsigned int for num_threads_

eb4b752

Keep correspondences ordered with inplace_merge

f3b2762

kunaltyagi reviewed Nov 22, 2020

View reviewed changes

koide3 added 2 commits November 24, 2020 15:52

Apply clang-format to correspondence_estimation.(h|hpp) and icp.h

f031628

Use move and emplace_back

3cdbfc6

kunaltyagi reviewed Nov 26, 2020

View reviewed changes

Comment thread registration/include/pcl/registration/impl/correspondence_estimation.hpp Outdated

Comment thread registration/include/pcl/registration/impl/correspondence_estimation.hpp Outdated

koide3 added 2 commits December 14, 2020 10:50

remove unnecessary include

d2f695b

Resolve omp build errors on gcc (Class members cannot be shared varia…

41469cb

…bles)

kunaltyagi previously approved these changes Dec 14, 2020

View reviewed changes

Wrap omp_get_thread_num() in ifdef _OPENMP

e943334

mvieth reviewed Mar 31, 2021

View reviewed changes

Comment thread registration/include/pcl/registration/correspondence_estimation.h Outdated

Comment thread registration/include/pcl/registration/correspondence_estimation.h Outdated

Comment thread registration/include/pcl/registration/icp.h

kunaltyagi dismissed their stale review via e943334 July 27, 2021 10:02

mvieth added 4 commits January 6, 2024 18:16

Merge branch 'master' into multi-threaded-icp

32c287f

Formatting

e2cda45

Remove std::move per clang-tidy suggestion

43ef180

Fix typo

5eba455

mvieth added this to the pcl-1.15.0 milestone Jan 20, 2024

mvieth added the changelog: ABI break Meta-information for changelog generation label Jan 20, 2024

kangzixiang approved these changes Apr 13, 2024

View reviewed changes

mvieth approved these changes May 5, 2024

View reviewed changes

mvieth merged commit 91bc0fd into PointCloudLibrary:master May 5, 2024

Uh oh!

Conversation

koide3 commented Nov 18, 2020

Uh oh!

kunaltyagi commented Nov 19, 2020

Uh oh!

JStech left a comment

Choose a reason for hiding this comment

Uh oh!

JStech Nov 20, 2020

Choose a reason for hiding this comment

Uh oh!

koide3 Nov 22, 2020

Choose a reason for hiding this comment

Uh oh!

JStech Nov 20, 2020

Choose a reason for hiding this comment

Uh oh!

koide3 Nov 22, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kunaltyagi Nov 20, 2020

Choose a reason for hiding this comment

Uh oh!

koide3 Nov 22, 2020

Choose a reason for hiding this comment

Uh oh!

kunaltyagi Nov 22, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kunaltyagi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

larshg commented Dec 20, 2020

Uh oh!

koide3 commented Dec 22, 2020

Uh oh!

mvieth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

keineahnung2345 commented Jan 2, 2024

Uh oh!

mvieth commented Jan 5, 2024

Uh oh!

mvieth commented Jan 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants