[registration] Add OMP based Multi-threading to IterativeClosestPoint#4520
[registration] Add OMP based Multi-threading to IterativeClosestPoint#4520mvieth merged 12 commits intoPointCloudLibrary:masterfrom
Conversation
When @shrijitsingh99 is able to integrate the executor and test, we would be one step closer. He mentioned he'd have cycles free near the winter holidays |
JStech
left a comment
There was a problem hiding this comment.
Looks like a big improvement. I just saw a couple places that unsigned integers could be used.
| /** \brief Set the number of threads to use. | ||
| * \param nr_threads the number of hardware threads to use (0 sets the value back to automatic) | ||
| */ | ||
| void setNumberOfThreads(int nr_threads) { |
There was a problem hiding this comment.
nr_threads should be unsigned
| * will never be recomputed*/ | ||
| bool force_no_recompute_reciprocal_; | ||
|
|
||
| int num_threads_; |
There was a problem hiding this comment.
num_threads_ should also be unsigned
|
|
||
| if(num_threads_ != 1) { | ||
| // Make correspondences ordered | ||
| std::sort(correspondences.begin(), correspondences.end(), [](const auto& lhs, const auto& rhs) { return lhs.index_query < rhs.index_query; }); |
There was a problem hiding this comment.
Since you're sorting here, shall we have a few alternate strategies to benchmark against?
- Separate correspondences per thread and merge-sort them in the end
- Critical section for binary search and inserting, so there's no need for a search in the end
I think the 2nd option might be the slowest, but that's just a hunch
There was a problem hiding this comment.
Good suggestion. I compared the following strategies:
- Use atomic and sort (the first commit)
- Separate correspondences per-thread and merge them after the loop (the last commit)
- Do binary search and insert in critical section
- Use OMP user-defined reduction
Here is a short summary of the benchmark results, and I think the 2nd strategy is the best one.
- Sort gets slower as the number of threads increases
- Almost constant merge time and small overhead
- It was the slowest as expected
- No post process time, but slightly larger overhead
Regarding the part of inplace_merge, although it would be better to merge results in a divide-and-conquer way, I think it makes the code much complex. Since we can expect num_threads is not so large, I consider the current form (merging results sequentially) is sufficient.
Benchmark results:
# 69088pts vs 69792pts
# CorrespondenceEstimation::determineCorrespondences()
# without omp (baseline)
num_threads:total time
1:169.906[msec]
2:177.868[msec]
3:172.066[msec]
4:169.009[msec]
5:169.16[msec]
# 1. atomic and sort
num_threads: for loop, post process(sort), total time
1:169.316[msec] 5e-05[msec] 169.327[msec]
2:98.9463[msec] 1.64847[msec] 100.606[msec]
3:108.082[msec] 2.1331[msec] 110.225[msec]
4:66.546[msec] 2.39187[msec] 68.9498[msec]
5:62.1945[msec] 2.65166[msec] 64.8581[msec]
6:52.2621[msec] 3.33361[msec] 55.6112[msec]
7:60.9466[msec] 2.83809[msec] 63.7972[msec]
8:45.5633[msec] 2.79004[msec] 48.3653[msec]
9:48.2131[msec] 2.88753[msec] 51.1126[msec]
# 2. per-thread correspondences and inplace_merge
num_threads: for loop, post process(merge), total time
1:167.489[msec] 4.8e-05[msec] 167.502[msec]
2:93.6329[msec] 0.490607[msec] 94.138[msec]
3:100.873[msec] 0.604739[msec] 101.493[msec]
4:65.5067[msec] 0.402887[msec] 65.9252[msec]
5:61.0197[msec] 0.674917[msec] 61.7172[msec]
6:60.6732[msec] 0.529535[msec] 61.2234[msec]
7:50.8765[msec] 0.557357[msec] 51.449[msec]
8:61.3818[msec] 0.605387[msec] 62.0017[msec]
9:48.0522[msec] 0.634407[msec] 48.7004[msec]
# 3. insert in critical section
num_threads: total time
1:3174.6[msec]
2:4121.34[msec]
3:4462.44[msec]
4:4753.24[msec]
5:4854.75[msec]
# 4. OMP user-defined reduction
num_threads: for loop, post process, total time
1:174.546[msec] 4.6e-05[msec] 174.638[msec]
2:94.445[msec] 5.2e-05[msec] 94.4573[msec]
3:102.331[msec] 7.6e-05[msec] 102.343[msec]
4:68.7443[msec] 0.000117[msec] 68.8659[msec]
5:68.1277[msec] 4e-05[msec] 68.1397[msec]
6:53.522[msec] 4.4e-05[msec] 53.5339[msec]
7:62.1336[msec] 0.000172[msec] 62.1449[msec]
8:57.7286[msec] 0.000171[msec] 57.7404[msec]
9:58.9015[msec] 0.000131[msec] 58.915[msec]
Code of the 4th strategy (omp reduction) was something like the following one:
#pragma omp declare reduction \
(merge: pcl::Correspondences : omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()) )
pcl::Correspondences corrs;
#pragma omp parallel for \
default(none) \
shared(tree_, indices_, max_dist_sqr) \
firstprivate(index, distance) \
reduction(merge : corrs) \
num_threads(num_threads_)
for (int i = 0; i < static_cast<int>(indices_->size()); i++) {
const auto& idx = (*indices_)[i];
tree_->nearestKSearch((*input_)[idx], 1, index, distance);
if (distance[0] > max_dist_sqr)
continue;
pcl::Correspondence corr;
corr.index_query = idx;
corr.index_match = index[0];
corr.distance = distance[0];
corrs.push_back(corr);
}
correspondences.swap(corrs);
There was a problem hiding this comment.
Nice experiment @koide3 ! ❤️
Thanks a lot for the follow-through. Leaving this un-resolved so others can see this too 😄
|
Build errors in tutorials: Is it a missing include or? |
I wrapped |
mvieth
left a comment
There was a problem hiding this comment.
Sorry for taking so long with the review.
Is the inplace_merge at the end even necessary? Wouldn't the correspondences be automatically sorted after the move? At least for static scheduling? I probably overlooked something there 😄
|
This seems useful. Any chance to get this moved forward? |
I will take another look |
|
I am happy with the changes now. I will merge this after PCL 1.14.1 is released because adding the |
This PR adds multi-threading capability to
pcl::IterativeClosestPointandpcl::CorrespondenceEstimation.pcl::ICPspends about 95% of time for correspondence estimation, and thus it can be largely sped up by makingCorrespondenceEstimationmulti-threaded. Some classes that inherit frompcl::ICPcan also be sped up with this PR (e.g.,pcl::IterativeClosestPointNonLinear).Note: The sort at the end of
CorrespondenceEstimation::determineCorrespondences()is necessary to pass some tests because they assume correspondences are ordered. I also triedomp orderedto keep correspondences ordered, but it was very slow.Off-topic: Should we make a common interface for
setNumberOfThreads()?Some benchmark results: