Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected Performance: Single-Threaded Faster than Multi-Threaded in Point Cloud Alignment #145

Closed
Ea510chan opened this issue Jan 13, 2024 · 2 comments

Comments

@Ea510chan
Copy link

Hello there! Thanks for your great work!
I had an issue when I deployed it on my pc. can anyone help me take a look? Thanks!!

Description

I have observed an unexpected performance behavior while using fast_gicp_mt. Specifically, the single-threaded versions of certain point cloud alignment algorithms such as GICP and NDT are outperforming their multi-threaded counterparts. This was observed while aligning two point clouds of sizes 17047 and 17334 points.

Environment

The repo is deployed using docker.
OS: Ubuntu20.04 + ROS Noetic
GPU: RTX4090 32GB
CPU: i9-13900KF
RAM: 32GB
I deployed the repo on WSL using Docker.

Details

The execution times for various algorithms were recorded, and it was noted that single-threaded implementations were consistently faster than multi-threaded ones. Below are some of the results obtained:

$ rosrun fast_gicp gicp_align 251370668.pcd 251371071.pcd
target:17047[pts] source:17334[pts]
--- pcl_gicp ---
single:110.186[msec] 100times:11059.9[msec] fitness_score:0.204892
--- pcl_ndt ---
single:39.1375[msec] 100times:4043.5[msec] fitness_score:0.229616
--- fgicp_st ---
single:101.371[msec] 100times:9945.61[msec] 100times_reuse:6586.6[msec] fitness_score:0.204376
--- fgicp_mt ---
single:135.229[msec] 100times:12986.9[msec] 100times_reuse:11950.3[msec] fitness_score:0.204384
--- vgicp_st ---
single:85.6506[msec] 100times:7514.18[msec] 100times_reuse:4194.52[msec] fitness_score:0.205022
--- vgicp_mt ---
single:158.688[msec] 100times:16300.5[msec] 100times_reuse:15309.5[msec] fitness_score:0.205022
--- ndt_cuda (P2D) ---
single:17.4151[msec] 100times:1702.9[msec] 100times_reuse:1340.19[msec] fitness_score:0.197208
--- ndt_cuda (D2D) ---
single:13.5261[msec] 100times:1391.88[msec] 100times_reuse:1119.26[msec] fitness_score:0.199985
--- vgicp_cuda (parallel_kdtree) ---
single:37.8372[msec] 100times:3054.31[msec] 100times_reuse:1987.94[msec] fitness_score:0.205017
--- vgicp_cuda (gpu_bruteforce) ---
single:65.4749[msec] 100times:3064.62[msec] 100times_reuse:2966.4[msec] fitness_score:0.249594
--- vgicp_cuda (gpu_rbf_kernel) ---
single:13.1453[msec] 100times:1515.33[msec] 100times_reuse:1119.99[msec] fitness_score:0.204766

Expected Behavior:

Typically, one would expect the multi-threaded implementations to be faster or at least as fast as the single-threaded ones, especially when dealing with large datasets.

@Ea510chan
Copy link
Author

Hi everyone,

I wanted to share an update on the performance issue I was experiencing with the multi-threaded versions of point cloud alignment algorithms.

Initially, I was using the maximum thread count supported by my CPU (32 threads), but this setup was actually resulting in slower performance compared to the single-threaded implementations.

However, when I reduced the number of threads to 8, the processing times for the multi-threaded versions improved dramatically and became what one would expect - faster than the single-threaded versions. Here are the updated results:

$ rosrun fast_gicp gicp_align 251370668.pcd 251371071.pcd
target:17047[pts] source:17334[pts]
--- pcl_gicp ---
single:114.265[msec] 100times:11190.8[msec] fitness_score:0.204892
--- pcl_ndt ---
single:40.3903[msec] 100times:4108.75[msec] fitness_score:0.229616
--- fgicp_st ---
single:103.508[msec] 100times:10122.8[msec] 100times_reuse:6677.71[msec] fitness_score:0.204376
--- fgicp_mt ---
single:22.2643[msec] 100times:2076.86[msec] 100times_reuse:1322.39[msec] fitness_score:0.204384
--- vgicp_st ---
single:76.7637[msec] 100times:7601.88[msec] 100times_reuse:4227.26[msec] fitness_score:0.205022
--- vgicp_mt ---
single:16.8928[msec] 100times:1723.56[msec] 100times_reuse:964.225[msec] fitness_score:0.205022
--- ndt_cuda (P2D) ---
single:17.818[msec] 100times:1747.58[msec] 100times_reuse:1329.59[msec] fitness_score:0.197216
--- ndt_cuda (D2D) ---
single:13.9255[msec] 100times:1415.41[msec] 100times_reuse:1161.17[msec] fitness_score:0.199983
--- vgicp_cuda (parallel_kdtree) ---
single:36.8168[msec] 100times:2271.8[msec] 100times_reuse:1713.19[msec] fitness_score:0.205017
--- vgicp_cuda (gpu_bruteforce) ---
single:55.5222[msec] 100times:2822.75[msec] 100times_reuse:2615.85[msec] fitness_score:0.249594
--- vgicp_cuda (gpu_rbf_kernel) ---
single:14.8914[msec] 100times:1403.59[msec] 100times_reuse:941.221[msec] fitness_score:0.204766

It appears that using the maximum thread count was creating a bottleneck, possibly due to overheads associated with context switching or resource contention. Using a reduced thread count that better aligns with the CPU's capabilities and the workload's nature seems to be the key to optimal performance.

@koide3
Copy link
Member

koide3 commented Jan 15, 2024

Thanks for the helpful information. I will mention this in README.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants