Skip to content

Conversation

illwieckz
Copy link
Member

Chunked version of:

For unknown reasons, only one thread is spawned.

@illwieckz illwieckz marked this pull request as draft October 1, 2025 00:38
@illwieckz illwieckz force-pushed the illwieckz/poc-multithread-cpu-model-gnu-chunk branch 3 times, most recently from d250969 to 8c6aa41 Compare October 1, 2025 00:42
@illwieckz
Copy link
Member Author

@slipher here is my chunked variant, but it doesn't spawn more than one thread.

Actually, it could be possible to write a custom dispatch function (I don't know how to do it).

@illwieckz
Copy link
Member Author

I added a commit that uses standard threading functions instead of __gnu_parallel::for_each() or std::for_each( std::execution::par, …), that would allow us to not require libgomp or libtbb, but it still runs as if all was done sequentially.

I added a logger to check things and things look correct. I don't know what is missing.

@illwieckz
Copy link
Member Author

Hmm, one drawback of doing it that way, is that unlike what does OpenMP, threads are not reused and then profilers like Orbit list thousands of threads and not only that amount is crazy to list, but profiling is just meh because computed statistics are for each thread separately. It would be cool to be able to reuse those threads.

@illwieckz
Copy link
Member Author

But at least, if we could get the current implementation working that would be a start.

@illwieckz illwieckz force-pushed the illwieckz/poc-multithread-cpu-model-gnu-chunk branch from 50dccf8 to f976903 Compare October 1, 2025 02:53
@illwieckz
Copy link
Member Author

Hmm, actually, it seems to work, with my custom thread start. It's just so inefficient that performance drops like if nothing was done. When switching from 1 thread to 2 I see a performance difference. It's just so bad compared to OpenMP.

But now, I don't get why this chunked implementation doesn't work with OpenMP.

@illwieckz
Copy link
Member Author

I got it working with OpenMP, I had to use another syntax (which in fact skips my useless vector trick). I now get 438fps with 16 threads, which is much faster! So yes my idea of controlling the way it is chunked was good!

@illwieckz
Copy link
Member Author

With this implementation we are now as fast in the CPU code than when using the GPU code running on CPU with the llvmpipe software renderer. With LIBGL_ALWAYS_SOFTWARE=1 I get the exact same framerate wether I enable r_vboVertexSkinning or disable it. Before, the difference wasn't big, now I see none.

@illwieckz
Copy link
Member Author

The experiment was a success. I close this and will submit a completed and cleaned-up branch later.

@illwieckz illwieckz closed this Oct 1, 2025
@illwieckz illwieckz deleted the illwieckz/poc-multithread-cpu-model-gnu-chunk branch October 1, 2025 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant