Skip to content

[registration] Add OMP based Multi-threading to IterativeClosestPoint#4520

Merged
mvieth merged 12 commits intoPointCloudLibrary:masterfrom
koide3:multi-threaded-icp
May 5, 2024
Merged

[registration] Add OMP based Multi-threading to IterativeClosestPoint#4520
mvieth merged 12 commits intoPointCloudLibrary:masterfrom
koide3:multi-threaded-icp

Conversation

@koide3
Copy link
Copy Markdown
Contributor

@koide3 koide3 commented Nov 18, 2020

This PR adds multi-threading capability to pcl::IterativeClosestPoint and pcl::CorrespondenceEstimation. pcl::ICP spends about 95% of time for correspondence estimation, and thus it can be largely sped up by making CorrespondenceEstimation multi-threaded. Some classes that inherit from pcl::ICP can also be sped up with this PR (e.g., pcl::IterativeClosestPointNonLinear).

Note: The sort at the end of CorrespondenceEstimation::determineCorrespondences() is necessary to pass some tests because they assume correspondences are ordered. I also tried omp ordered to keep correspondences ordered, but it was very slow.

Off-topic: Should we make a common interface for setNumberOfThreads()?

Some benchmark results:

# aligning 16k points

* with omp
threads: time
1:153.565[msec]
2:87.3185[msec]
3:67.8443[msec]
4:54.0599[msec]
5:45.6129[msec]
6:40.9252[msec]
7:43.6004[msec]
8:40.2708[msec]
9:46.0424[msec]

* without omp
1:149.439[msec]
2:150.391[msec]
3:152.18[msec]
4:153.475[msec]
5:149.943[msec]
6:149.483[msec]
7:148.78[msec]
8:149.607[msec]
9:149.231[msec]

* without sort
1:149.21[msec]
2:84.0002[msec]
3:60.262[msec]
4:47.0388[msec]
5:39.0977[msec]
6:36.8624[msec]
7:36.5644[msec]
8:32.6782[msec]
9:39.3133[msec]

* use omp ordered instead of sort
1:209.918[msec]
2:196.892[msec]
3:197.966[msec]
4:196.945[msec]
5:197.686[msec]
6:200.591[msec]
7:200.022[msec]
8:208.916[msec]
9:223.545[msec]


* with omp and use_reciprocal_correspondence_ == true
1:331.359[msec]
2:180.315[msec]
3:130.543[msec]
4:106.585[msec]
5:91.2735[msec]
6:88.4035[msec]
7:94.627[msec]
8:99.4426[msec]
9:97.6912[msec]

* without omp and use_reciprocal_correspondence_ == true
1:329.28[msec]
2:324.217[msec]
3:322.527[msec]
4:322.287[msec]
5:322.802[msec]
6:326.154[msec]
7:329.858[msec]
8:333.565[msec]
9:336.588[msec]

@koide3 koide3 requested review from kunaltyagi and mvieth November 18, 2020 10:55
@koide3 koide3 added module: registration needs: code review Specify why not closed/merged yet labels Nov 18, 2020
@kunaltyagi
Copy link
Copy Markdown
Member

Off-topic: Should we make a common interface for setNumberOfThreads()?

When @shrijitsingh99 is able to integrate the executor and test, we would be one step closer. He mentioned he'd have cycles free near the winter holidays
But yes, in principle I concur with you on this

Copy link
Copy Markdown
Contributor

@JStech JStech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a big improvement. I just saw a couple places that unsigned integers could be used.

/** \brief Set the number of threads to use.
* \param nr_threads the number of hardware threads to use (0 sets the value back to automatic)
*/
void setNumberOfThreads(int nr_threads) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nr_threads should be unsigned

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* will never be recomputed*/
bool force_no_recompute_reciprocal_;

int num_threads_;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_threads_ should also be unsigned

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread registration/include/pcl/registration/impl/correspondence_estimation.hpp Outdated

if(num_threads_ != 1) {
// Make correspondences ordered
std::sort(correspondences.begin(), correspondences.end(), [](const auto& lhs, const auto& rhs) { return lhs.index_query < rhs.index_query; });
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're sorting here, shall we have a few alternate strategies to benchmark against?

  1. Separate correspondences per thread and merge-sort them in the end
  2. Critical section for binary search and inserting, so there's no need for a search in the end

I think the 2nd option might be the slowest, but that's just a hunch

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. I compared the following strategies:

  1. Use atomic and sort (the first commit)
  2. Separate correspondences per-thread and merge them after the loop (the last commit)
  3. Do binary search and insert in critical section
  4. Use OMP user-defined reduction

Here is a short summary of the benchmark results, and I think the 2nd strategy is the best one.

  1. Sort gets slower as the number of threads increases
  2. Almost constant merge time and small overhead
  3. It was the slowest as expected
  4. No post process time, but slightly larger overhead

Regarding the part of inplace_merge, although it would be better to merge results in a divide-and-conquer way, I think it makes the code much complex. Since we can expect num_threads is not so large, I consider the current form (merging results sequentially) is sufficient.

Benchmark results:

# 69088pts vs 69792pts
# CorrespondenceEstimation::determineCorrespondences()

# without omp (baseline)
num_threads:total time
1:169.906[msec]
2:177.868[msec]
3:172.066[msec]
4:169.009[msec]
5:169.16[msec]

# 1. atomic and sort
num_threads: for loop, post process(sort), total time
1:169.316[msec] 5e-05[msec] 169.327[msec]
2:98.9463[msec] 1.64847[msec] 100.606[msec]
3:108.082[msec] 2.1331[msec] 110.225[msec]
4:66.546[msec] 2.39187[msec] 68.9498[msec]
5:62.1945[msec] 2.65166[msec] 64.8581[msec]
6:52.2621[msec] 3.33361[msec] 55.6112[msec]
7:60.9466[msec] 2.83809[msec] 63.7972[msec]
8:45.5633[msec] 2.79004[msec] 48.3653[msec]
9:48.2131[msec] 2.88753[msec] 51.1126[msec]

# 2. per-thread correspondences and inplace_merge
num_threads: for loop, post process(merge), total time
1:167.489[msec] 4.8e-05[msec] 167.502[msec]
2:93.6329[msec] 0.490607[msec] 94.138[msec]
3:100.873[msec] 0.604739[msec] 101.493[msec]
4:65.5067[msec] 0.402887[msec] 65.9252[msec]
5:61.0197[msec] 0.674917[msec] 61.7172[msec]
6:60.6732[msec] 0.529535[msec] 61.2234[msec]
7:50.8765[msec] 0.557357[msec] 51.449[msec]
8:61.3818[msec] 0.605387[msec] 62.0017[msec]
9:48.0522[msec] 0.634407[msec] 48.7004[msec]

# 3. insert in critical section
num_threads: total time
1:3174.6[msec]
2:4121.34[msec]
3:4462.44[msec]
4:4753.24[msec]
5:4854.75[msec]

# 4. OMP user-defined reduction
num_threads: for loop, post process, total time
1:174.546[msec] 4.6e-05[msec] 174.638[msec]
2:94.445[msec] 5.2e-05[msec] 94.4573[msec]
3:102.331[msec] 7.6e-05[msec] 102.343[msec]
4:68.7443[msec] 0.000117[msec] 68.8659[msec]
5:68.1277[msec] 4e-05[msec] 68.1397[msec]
6:53.522[msec] 4.4e-05[msec] 53.5339[msec]
7:62.1336[msec] 0.000172[msec] 62.1449[msec]
8:57.7286[msec] 0.000171[msec] 57.7404[msec]
9:58.9015[msec] 0.000131[msec] 58.915[msec]

Code of the 4th strategy (omp reduction) was something like the following one:

  #pragma omp declare reduction \
    (merge: pcl::Correspondences : omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()) )
  pcl::Correspondences corrs;

  #pragma omp parallel for \
    default(none) \
    shared(tree_, indices_, max_dist_sqr) \
    firstprivate(index, distance) \
    reduction(merge : corrs) \
    num_threads(num_threads_)
  for (int i = 0; i < static_cast<int>(indices_->size()); i++) {
    const auto& idx = (*indices_)[i];
    tree_->nearestKSearch((*input_)[idx], 1, index, distance);
    if (distance[0] > max_dist_sqr)
      continue;

    pcl::Correspondence corr;
    corr.index_query = idx;
    corr.index_match = index[0];
    corr.distance = distance[0];

    corrs.push_back(corr);
  }

  correspondences.swap(corrs);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice experiment @koide3 ! ❤️
Thanks a lot for the follow-through. Leaving this un-resolved so others can see this too 😄

Comment thread registration/include/pcl/registration/impl/correspondence_estimation.hpp Outdated
Comment thread registration/include/pcl/registration/impl/correspondence_estimation.hpp Outdated
Comment thread registration/include/pcl/registration/impl/correspondence_estimation.hpp Outdated
Comment thread registration/include/pcl/registration/impl/correspondence_estimation.hpp Outdated
Comment thread registration/include/pcl/registration/impl/correspondence_estimation.hpp Outdated
Copy link
Copy Markdown
Member

@kunaltyagi kunaltyagi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failing tests

Comment thread registration/include/pcl/registration/impl/correspondence_estimation.hpp Outdated
Comment thread registration/include/pcl/registration/impl/correspondence_estimation.hpp Outdated
kunaltyagi
kunaltyagi previously approved these changes Dec 14, 2020
@larshg
Copy link
Copy Markdown
Contributor

larshg commented Dec 20, 2020

Build errors in tutorials:

 error: there are no arguments to 'omp_get_thread_num' that depend on a template parameter, so a declaration of 'omp_get_thread_num' must be available [-fpermissive]
       per_thread_correspondences[omp_get_thread_num()].emplace_back(std::move(corr));

Is it a missing include or?

@koide3
Copy link
Copy Markdown
Contributor Author

koide3 commented Dec 22, 2020

Build errors in tutorials:

I wrapped omp_get_thread_num in #ifdef _OPENMP so it doesn't cause errors on omp-disabled environments.

Copy link
Copy Markdown
Member

@mvieth mvieth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for taking so long with the review.
Is the inplace_merge at the end even necessary? Wouldn't the correspondences be automatically sorted after the move? At least for static scheduling? I probably overlooked something there 😄

Comment thread registration/include/pcl/registration/correspondence_estimation.h Outdated
Comment thread registration/include/pcl/registration/correspondence_estimation.h Outdated
Comment thread registration/include/pcl/registration/icp.h
@keineahnung2345
Copy link
Copy Markdown
Contributor

This seems useful. Any chance to get this moved forward?

@mvieth
Copy link
Copy Markdown
Member

mvieth commented Jan 5, 2024

This seems useful. Any chance to get this moved forward?

I will take another look

@mvieth mvieth added this to the pcl-1.15.0 milestone Jan 20, 2024
@mvieth
Copy link
Copy Markdown
Member

mvieth commented Jan 20, 2024

I am happy with the changes now. I will merge this after PCL 1.14.1 is released because adding the num_threads_ variable breaks the ABI, and we don't want an ABI break between 1.14.0 and 1.14.1

@mvieth mvieth added the changelog: ABI break Meta-information for changelog generation label Jan 20, 2024
@mvieth mvieth merged commit 91bc0fd into PointCloudLibrary:master May 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog: ABI break Meta-information for changelog generation module: registration needs: code review Specify why not closed/merged yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants