Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed fix for #40 #41

Merged
merged 6 commits into from
Jul 3, 2023

Conversation

jtd-formlabs
Copy link
Contributor

This PR edits thread_safe_queue to have more functionality and serve as a priority queue for the threads.

At the end of each thread execution, the threads bump themselves up to the top of the priority list for work assignment.

@jtd-formlabs
Copy link
Contributor Author

I have addressed the issues found by the workflows, and added a new unit test that I confirmed (on my machine) fails on master but passes on this branch.

@DeveloperPaul123 DeveloperPaul123 self-requested a review June 14, 2023 19:53
@DeveloperPaul123
Copy link
Owner

Thank you for this! Once I have some time i can look into this more thoroughly and run it on my machine. I don't see any issues with merging it once I've don't that. Thanks again!

@DeveloperPaul123
Copy link
Owner

Overall this looks really good. It seems that this has a significant performance impact on the thread pool as well but I'm skeptical on the numbers I'm seeing. Here are the new benchmark results with MSVC

relative ms/op op/s err% total matrix multiplication 8x8
100.0% 73.49 13.61 0.8% 18.22 dp::thread_pool - std::function
105.5% 69.62 14.36 0.3% 16.94 dp::thread_pool - std::move_only_function
98.0% 74.99 13.34 1.1% 18.22 dp::thread_pool - fu2::unique_function
39.7% 185.12 5.40 0.4% 44.72 BS::thread_pool
50.7% 144.83 6.90 0.7% 34.89 task_thread_pool
40.2% 182.84 5.47 0.5% 44.23 riften::Thiefpool
relative ms/op op/s err% total matrix multiplication 64x64
100.0% 58.25 17.17 4.4% 14.69 dp::thread_pool - std::function
108.8% 53.55 18.68 0.7% 13.01 dp::thread_pool - std::move_only_function
101.3% 57.53 17.38 0.9% 13.94 dp::thread_pool - fu2::unique_function
40.3% 144.52 6.92 0.3% 34.92 BS::thread_pool
45.2% 128.80 7.76 0.2% 31.15 task_thread_pool
42.4% 137.50 7.27 0.3% 33.40 riften::Thiefpool
relative ms/op op/s err% total matrix multiplication 256x256
100.0% 47.36 21.11 2.6% 11.54 dp::thread_pool - std::function
106.5% 44.49 22.48 0.2% 10.74 dp::thread_pool - std::move_only_function
100.1% 47.31 21.14 0.2% 11.46 dp::thread_pool - fu2::unique_function
47.0% 100.80 9.92 0.3% 24.42 BS::thread_pool
51.1% 92.77 10.78 0.4% 22.41 task_thread_pool
46.3% 102.20 9.78 0.7% 24.99 riften::Thiefpool
relative ms/op op/s err% total matrix multiplication 512x512
100.0% 38.02 26.30 0.3% 9.20 dp::thread_pool - std::function
134.7% 28.23 35.42 2.3% 6.85 dp::thread_pool - std::move_only_function
97.3% 39.08 25.59 0.6% 9.50 dp::thread_pool - fu2::unique_function
49.5% 76.79 13.02 0.2% 18.57 BS::thread_pool
53.1% 71.65 13.96 0.2% 17.35 task_thread_pool
48.6% 78.29 12.77 0.4% 18.91 riften::Thiefpool
relative ms/op op/s err% total matrix multiplication 1024x1024
100.0% 42.87 23.33 1.9% 10.32 dp::thread_pool - std::function
99.9% 42.91 23.31 2.2% 10.30 dp::thread_pool - std::move_only_function
101.9% 42.08 23.76 1.4% 10.21 dp::thread_pool - fu2::unique_function
73.0% 58.70 17.03 0.6% 14.22 BS::thread_pool
77.5% 55.32 18.08 0.7% 13.45 task_thread_pool
99.5% 43.08 23.21 2.2% 10.47 riften::Thiefpool

@jtd-formlabs Do you have any input or insights on what could be causing this large of a performance uplift? Previous benchmarks showed my library edging out some other popular libraries but now it blows them out of the water. I'm not mad at that, but I'm wondering if there is something I'm missing here...

@jtd-formlabs
Copy link
Contributor Author

This is really interesting! I'll run your benchmarks on my Ubuntu system to get some additional metrics as well.

One theory is that this new scheduling system is relying on the work stealing process much less, as it will always defer to a thread that is ready for work first so there is less delay.

It may be worth adding some code to before and after this PR to see how many workloads are stolen during the benchmarks.

@DeveloperPaul123
Copy link
Owner

Yes I agree with your comments. This should result in the threads always having work to do in general and having to steal less (if at all).

I'm curious to see the numbers on Ubuntu as well as I haven't tried running benchmarks there yet.

@jtd-formlabs
Copy link
Contributor Author

jtd-formlabs commented Jul 3, 2023

So I ran two benchmarks, one with pyperf system and one without:

WITH:

relative ms/op op/s err% total matrix multiplication 8x8
100.0% 182.55 5.48 0.4% 32.67 dp::thread_pool - std::function
99.0% 184.43 5.42 0.3% 33.04 dp::thread_pool - std::move_only_function
98.9% 184.60 5.42 0.5% 33.13 dp::thread_pool - fu2::unique_function
74.9% 243.85 4.10 0.3% 44.00 BS::thread_pool
56.0% 326.01 3.07 0.5% 58.38 task_thread_pool
132.7% 137.60 7.27 1.6% 25.69 riften::Thiefpool
relative ms/op op/s err% total matrix multiplication 64x64
100.0% 159.72 6.26 0.6% 28.60 dp::thread_pool - std::function
96.7% 165.12 6.06 0.3% 29.62 dp::thread_pool - std::move_only_function
100.5% 158.87 6.29 0.4% 28.42 dp::thread_pool - fu2::unique_function
82.0% 194.82 5.13 0.4% 35.09 BS::thread_pool
63.3% 252.48 3.96 0.4% 45.24 task_thread_pool
105.8% 150.96 6.62 4.1% 27.27 riften::Thiefpool
relative ms/op op/s err% total matrix multiplication 256x256
100.0% 118.77 8.42 0.2% 21.26 dp::thread_pool - std::function
92.9% 127.82 7.82 0.7% 22.89 dp::thread_pool - std::move_only_function
99.0% 119.94 8.34 0.7% 21.49 dp::thread_pool - fu2::unique_function
81.2% 146.29 6.84 1.0% 26.19 BS::thread_pool
66.0% 180.02 5.55 0.1% 32.26 task_thread_pool
88.6% 134.01 7.46 1.5% 24.10 riften::Thiefpool
relative ms/op op/s err% total matrix multiplication 512x512
100.0% 99.88 10.01 1.3% 17.86 dp::thread_pool - std::function
99.9% 100.00 10.00 0.3% 17.93 dp::thread_pool - std::move_only_function
101.0% 98.93 10.11 0.3% 17.75 dp::thread_pool - fu2::unique_function
85.7% 116.59 8.58 0.7% 20.96 BS::thread_pool
76.1% 131.28 7.62 0.8% 23.60 task_thread_pool
100.0% 99.90 10.01 4.1% 18.02 riften::Thiefpool
relative ms/op op/s err% total matrix multiplication 1024x1024
100.0% 78.76 12.70 1.5% 14.20 dp::thread_pool - std::function
95.5% 82.48 12.12 1.0% 14.78 dp::thread_pool - std::move_only_function
100.1% 78.70 12.71 1.1% 14.08 dp::thread_pool - fu2::unique_function
78.6% 100.23 9.98 1.3% 18.05 BS::thread_pool
77.3% 101.88 9.82 0.5% 18.29 task_thread_pool
88.8% 88.66 11.28 3.3% 15.83 riften::Thiefpool

WITHOUT:

Warning, results might be unstable:
* CPU frequency scaling enabled: CPU 0 between 400.0 and 4,700.0 MHz
* CPU governor is 'powersave' but should be 'performance'
* Turbo is enabled, CPU frequency will fluctuate

Recommendations
* Use 'pyperf system tune' before benchmarking. See https://github.com/psf/pyperf
relative ms/op op/s err% total matrix multiplication 8x8
100.0% 114.40 8.74 2.0% 20.56 dp::thread_pool - std::function
92.6% 123.52 8.10 0.5% 22.06 dp::thread_pool - std::move_only_function
92.7% 123.39 8.10 0.7% 22.20 dp::thread_pool - fu2::unique_function
70.6% 162.04 6.17 0.3% 28.88 BS::thread_pool
53.2% 214.96 4.65 0.6% 38.59 task_thread_pool
114.4% 100.03 10.00 0.3% 18.20 riften::Thiefpool
relative ms/op op/s err% total matrix multiplication 64x64
100.0% 114.62 8.72 1.0% 20.52 dp::thread_pool - std::function
104.1% 110.14 9.08 0.8% 19.79 dp::thread_pool - std::move_only_function
106.4% 107.77 9.28 1.0% 19.31 dp::thread_pool - fu2::unique_function
86.4% 132.71 7.54 1.5% 23.65 BS::thread_pool
68.8% 166.60 6.00 0.7% 29.88 task_thread_pool
112.9% 101.49 9.85 1.6% 18.09 riften::Thiefpool
relative ms/op op/s err% total matrix multiplication 256x256
100.0% 82.04 12.19 0.6% 14.73 dp::thread_pool - std::function
97.6% 84.06 11.90 0.6% 14.99 dp::thread_pool - std::move_only_function
101.0% 81.26 12.31 2.1% 14.56 dp::thread_pool - fu2::unique_function
81.7% 100.41 9.96 0.6% 17.96 BS::thread_pool
68.2% 120.34 8.31 0.4% 21.49 task_thread_pool
86.0% 95.37 10.49 1.3% 17.16 riften::Thiefpool
relative ms/op op/s err% total matrix multiplication 512x512
100.0% 68.46 14.61 0.6% 12.26 dp::thread_pool - std::function
100.9% 67.83 14.74 1.0% 12.21 dp::thread_pool - std::move_only_function
100.1% 68.37 14.63 0.3% 12.26 dp::thread_pool - fu2::unique_function
84.3% 81.25 12.31 0.7% 14.59 BS::thread_pool
72.1% 94.99 10.53 0.6% 17.00 task_thread_pool
92.2% 74.27 13.46 1.6% 13.36 riften::Thiefpool
relative ms/op op/s err% total matrix multiplication 1024x1024
100.0% 59.33 16.86 0.6% 10.66 dp::thread_pool - std::function
95.8% 61.94 16.14 1.1% 11.14 dp::thread_pool - std::move_only_function
97.9% 60.61 16.50 1.0% 10.82 dp::thread_pool - fu2::unique_function
81.9% 72.44 13.80 1.0% 13.04 BS::thread_pool
71.4% 83.11 12.03 3.3% 15.51 task_thread_pool
92.2% 64.37 15.54 1.0% 11.47 riften::Thiefpool

This was on a laptop running Ubuntu 20.04 with the power cable plugged in. 20 core system with 32 gb of ram on an Intel processor.

@jtd-formlabs
Copy link
Contributor Author

I also needed a new include to compile with gcc 12, so I pushed that change. Both benchmarks were run compiling in release mode.

@DeveloperPaul123
Copy link
Owner

Hmm, very interesting results. I think your results are much more reasonable. I thought running benchmarks on windows might be a problem since there is no equivalent to pyperf system on windows that I know of, but I didn't think it would make such a difference. Unfortunately, pyperf system also doesn't work on WSL 2 either so I'm not sure what else I can do.

Regardless, I like the direction of this PR and will merge, but I will be hesitant to publish any new benchmark numbers until I can get more stable results.

@DeveloperPaul123 DeveloperPaul123 merged commit 65918a0 into DeveloperPaul123:master Jul 3, 2023
4 checks passed
@jtd-formlabs
Copy link
Contributor Author

Seems totally reasonable! thank you for taking the time to look this over and merge it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants