Proposed fix for #40 #41

jtd-formlabs · 2023-06-09T18:29:23Z

This PR edits thread_safe_queue to have more functionality and serve as a priority queue for the threads.

At the end of each thread execution, the threads bump themselves up to the top of the priority list for work assignment.

jtd-formlabs · 2023-06-14T14:25:24Z

I have addressed the issues found by the workflows, and added a new unit test that I confirmed (on my machine) fails on master but passes on this branch.

DeveloperPaul123 · 2023-06-14T19:54:34Z

Thank you for this! Once I have some time i can look into this more thoroughly and run it on my machine. I don't see any issues with merging it once I've don't that. Thanks again!

DeveloperPaul123 · 2023-07-03T16:42:40Z

Overall this looks really good. It seems that this has a significant performance impact on the thread pool as well but I'm skeptical on the numbers I'm seeing. Here are the new benchmark results with MSVC

relative	ms/op	op/s	err%	total	matrix multiplication 8x8
100.0%	73.49	13.61	0.8%	18.22	`dp::thread_pool - std::function`
105.5%	69.62	14.36	0.3%	16.94	`dp::thread_pool - std::move_only_function`
98.0%	74.99	13.34	1.1%	18.22	`dp::thread_pool - fu2::unique_function`
39.7%	185.12	5.40	0.4%	44.72	`BS::thread_pool`
50.7%	144.83	6.90	0.7%	34.89	`task_thread_pool`
40.2%	182.84	5.47	0.5%	44.23	`riften::Thiefpool`

relative	ms/op	op/s	err%	total	matrix multiplication 64x64
100.0%	58.25	17.17	4.4%	14.69	`dp::thread_pool - std::function`
108.8%	53.55	18.68	0.7%	13.01	`dp::thread_pool - std::move_only_function`
101.3%	57.53	17.38	0.9%	13.94	`dp::thread_pool - fu2::unique_function`
40.3%	144.52	6.92	0.3%	34.92	`BS::thread_pool`
45.2%	128.80	7.76	0.2%	31.15	`task_thread_pool`
42.4%	137.50	7.27	0.3%	33.40	`riften::Thiefpool`

relative	ms/op	op/s	err%	total	matrix multiplication 256x256
100.0%	47.36	21.11	2.6%	11.54	`dp::thread_pool - std::function`
106.5%	44.49	22.48	0.2%	10.74	`dp::thread_pool - std::move_only_function`
100.1%	47.31	21.14	0.2%	11.46	`dp::thread_pool - fu2::unique_function`
47.0%	100.80	9.92	0.3%	24.42	`BS::thread_pool`
51.1%	92.77	10.78	0.4%	22.41	`task_thread_pool`
46.3%	102.20	9.78	0.7%	24.99	`riften::Thiefpool`

relative	ms/op	op/s	err%	total	matrix multiplication 512x512
100.0%	38.02	26.30	0.3%	9.20	`dp::thread_pool - std::function`
134.7%	28.23	35.42	2.3%	6.85	`dp::thread_pool - std::move_only_function`
97.3%	39.08	25.59	0.6%	9.50	`dp::thread_pool - fu2::unique_function`
49.5%	76.79	13.02	0.2%	18.57	`BS::thread_pool`
53.1%	71.65	13.96	0.2%	17.35	`task_thread_pool`
48.6%	78.29	12.77	0.4%	18.91	`riften::Thiefpool`

relative	ms/op	op/s	err%	total	matrix multiplication 1024x1024
100.0%	42.87	23.33	1.9%	10.32	`dp::thread_pool - std::function`
99.9%	42.91	23.31	2.2%	10.30	`dp::thread_pool - std::move_only_function`
101.9%	42.08	23.76	1.4%	10.21	`dp::thread_pool - fu2::unique_function`
73.0%	58.70	17.03	0.6%	14.22	`BS::thread_pool`
77.5%	55.32	18.08	0.7%	13.45	`task_thread_pool`
99.5%	43.08	23.21	2.2%	10.47	`riften::Thiefpool`

@jtd-formlabs Do you have any input or insights on what could be causing this large of a performance uplift? Previous benchmarks showed my library edging out some other popular libraries but now it blows them out of the water. I'm not mad at that, but I'm wondering if there is something I'm missing here...

jtd-formlabs · 2023-07-03T16:50:02Z

This is really interesting! I'll run your benchmarks on my Ubuntu system to get some additional metrics as well.

One theory is that this new scheduling system is relying on the work stealing process much less, as it will always defer to a thread that is ready for work first so there is less delay.

It may be worth adding some code to before and after this PR to see how many workloads are stolen during the benchmarks.

DeveloperPaul123 · 2023-07-03T17:03:56Z

Yes I agree with your comments. This should result in the threads always having work to do in general and having to steal less (if at all).

I'm curious to see the numbers on Ubuntu as well as I haven't tried running benchmarks there yet.

jtd-formlabs · 2023-07-03T18:38:12Z

So I ran two benchmarks, one with pyperf system and one without:

WITH:

relative	ms/op	op/s	err%	total	matrix multiplication 8x8
100.0%	182.55	5.48	0.4%	32.67	`dp::thread_pool - std::function`
99.0%	184.43	5.42	0.3%	33.04	`dp::thread_pool - std::move_only_function`
98.9%	184.60	5.42	0.5%	33.13	`dp::thread_pool - fu2::unique_function`
74.9%	243.85	4.10	0.3%	44.00	`BS::thread_pool`
56.0%	326.01	3.07	0.5%	58.38	`task_thread_pool`
132.7%	137.60	7.27	1.6%	25.69	`riften::Thiefpool`

relative	ms/op	op/s	err%	total	matrix multiplication 64x64
100.0%	159.72	6.26	0.6%	28.60	`dp::thread_pool - std::function`
96.7%	165.12	6.06	0.3%	29.62	`dp::thread_pool - std::move_only_function`
100.5%	158.87	6.29	0.4%	28.42	`dp::thread_pool - fu2::unique_function`
82.0%	194.82	5.13	0.4%	35.09	`BS::thread_pool`
63.3%	252.48	3.96	0.4%	45.24	`task_thread_pool`
105.8%	150.96	6.62	4.1%	27.27	`riften::Thiefpool`

relative	ms/op	op/s	err%	total	matrix multiplication 256x256
100.0%	118.77	8.42	0.2%	21.26	`dp::thread_pool - std::function`
92.9%	127.82	7.82	0.7%	22.89	`dp::thread_pool - std::move_only_function`
99.0%	119.94	8.34	0.7%	21.49	`dp::thread_pool - fu2::unique_function`
81.2%	146.29	6.84	1.0%	26.19	`BS::thread_pool`
66.0%	180.02	5.55	0.1%	32.26	`task_thread_pool`
88.6%	134.01	7.46	1.5%	24.10	`riften::Thiefpool`

relative	ms/op	op/s	err%	total	matrix multiplication 512x512
100.0%	99.88	10.01	1.3%	17.86	`dp::thread_pool - std::function`
99.9%	100.00	10.00	0.3%	17.93	`dp::thread_pool - std::move_only_function`
101.0%	98.93	10.11	0.3%	17.75	`dp::thread_pool - fu2::unique_function`
85.7%	116.59	8.58	0.7%	20.96	`BS::thread_pool`
76.1%	131.28	7.62	0.8%	23.60	`task_thread_pool`
100.0%	99.90	10.01	4.1%	18.02	`riften::Thiefpool`

relative	ms/op	op/s	err%	total	matrix multiplication 1024x1024
100.0%	78.76	12.70	1.5%	14.20	`dp::thread_pool - std::function`
95.5%	82.48	12.12	1.0%	14.78	`dp::thread_pool - std::move_only_function`
100.1%	78.70	12.71	1.1%	14.08	`dp::thread_pool - fu2::unique_function`
78.6%	100.23	9.98	1.3%	18.05	`BS::thread_pool`
77.3%	101.88	9.82	0.5%	18.29	`task_thread_pool`
88.8%	88.66	11.28	3.3%	15.83	`riften::Thiefpool`

WITHOUT:

Warning, results might be unstable:
* CPU frequency scaling enabled: CPU 0 between 400.0 and 4,700.0 MHz
* CPU governor is 'powersave' but should be 'performance'
* Turbo is enabled, CPU frequency will fluctuate

Recommendations
* Use 'pyperf system tune' before benchmarking. See https://github.com/psf/pyperf

relative	ms/op	op/s	err%	total	matrix multiplication 8x8
100.0%	114.40	8.74	2.0%	20.56	`dp::thread_pool - std::function`
92.6%	123.52	8.10	0.5%	22.06	`dp::thread_pool - std::move_only_function`
92.7%	123.39	8.10	0.7%	22.20	`dp::thread_pool - fu2::unique_function`
70.6%	162.04	6.17	0.3%	28.88	`BS::thread_pool`
53.2%	214.96	4.65	0.6%	38.59	`task_thread_pool`
114.4%	100.03	10.00	0.3%	18.20	`riften::Thiefpool`

relative	ms/op	op/s	err%	total	matrix multiplication 64x64
100.0%	114.62	8.72	1.0%	20.52	`dp::thread_pool - std::function`
104.1%	110.14	9.08	0.8%	19.79	`dp::thread_pool - std::move_only_function`
106.4%	107.77	9.28	1.0%	19.31	`dp::thread_pool - fu2::unique_function`
86.4%	132.71	7.54	1.5%	23.65	`BS::thread_pool`
68.8%	166.60	6.00	0.7%	29.88	`task_thread_pool`
112.9%	101.49	9.85	1.6%	18.09	`riften::Thiefpool`

relative	ms/op	op/s	err%	total	matrix multiplication 256x256
100.0%	82.04	12.19	0.6%	14.73	`dp::thread_pool - std::function`
97.6%	84.06	11.90	0.6%	14.99	`dp::thread_pool - std::move_only_function`
101.0%	81.26	12.31	2.1%	14.56	`dp::thread_pool - fu2::unique_function`
81.7%	100.41	9.96	0.6%	17.96	`BS::thread_pool`
68.2%	120.34	8.31	0.4%	21.49	`task_thread_pool`
86.0%	95.37	10.49	1.3%	17.16	`riften::Thiefpool`

relative	ms/op	op/s	err%	total	matrix multiplication 512x512
100.0%	68.46	14.61	0.6%	12.26	`dp::thread_pool - std::function`
100.9%	67.83	14.74	1.0%	12.21	`dp::thread_pool - std::move_only_function`
100.1%	68.37	14.63	0.3%	12.26	`dp::thread_pool - fu2::unique_function`
84.3%	81.25	12.31	0.7%	14.59	`BS::thread_pool`
72.1%	94.99	10.53	0.6%	17.00	`task_thread_pool`
92.2%	74.27	13.46	1.6%	13.36	`riften::Thiefpool`

relative	ms/op	op/s	err%	total	matrix multiplication 1024x1024
100.0%	59.33	16.86	0.6%	10.66	`dp::thread_pool - std::function`
95.8%	61.94	16.14	1.1%	11.14	`dp::thread_pool - std::move_only_function`
97.9%	60.61	16.50	1.0%	10.82	`dp::thread_pool - fu2::unique_function`
81.9%	72.44	13.80	1.0%	13.04	`BS::thread_pool`
71.4%	83.11	12.03	3.3%	15.51	`task_thread_pool`
92.2%	64.37	15.54	1.0%	11.47	`riften::Thiefpool`

This was on a laptop running Ubuntu 20.04 with the power cable plugged in. 20 core system with 32 gb of ram on an Intel processor.

jtd-formlabs · 2023-07-03T18:41:44Z

I also needed a new include to compile with gcc 12, so I pushed that change. Both benchmarks were run compiling in release mode.

DeveloperPaul123 · 2023-07-03T18:51:53Z

Hmm, very interesting results. I think your results are much more reasonable. I thought running benchmarks on windows might be a problem since there is no equivalent to pyperf system on windows that I know of, but I didn't think it would make such a difference. Unfortunately, pyperf system also doesn't work on WSL 2 either so I'm not sure what else I can do.

Regardless, I like the direction of this PR and will merge, but I will be hesitant to publish any new benchmark numbers until I can get more stable results.

jtd-formlabs · 2023-07-03T18:57:44Z

Seems totally reasonable! thank you for taking the time to look this over and merge it!

jtd-formlabs added 5 commits June 9, 2023 18:28

mitigating bug by making a priority queue

06c5127

fixing bug where priority queue was emptied.

2565b27

removing unnecessary include.

31da698

unit tests now pass in my environment (ubuntu container)

45cff84

added unit test to catch the specific case that I found

c0e262c

DeveloperPaul123 self-requested a review June 14, 2023 19:53

adding #include statement to get gcc-12 to compile

9e32fc1

DeveloperPaul123 approved these changes Jul 3, 2023

View reviewed changes

DeveloperPaul123 merged commit 65918a0 into DeveloperPaul123:master Jul 3, 2023
4 checks passed

DeveloperPaul123 mentioned this pull request Jul 3, 2023

Thread pool hangs indefinitely if job scheduled on running thread and all other threads have finished execution #40

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed fix for #40 #41

Proposed fix for #40 #41

jtd-formlabs commented Jun 9, 2023

jtd-formlabs commented Jun 14, 2023

DeveloperPaul123 commented Jun 14, 2023

DeveloperPaul123 commented Jul 3, 2023

jtd-formlabs commented Jul 3, 2023

DeveloperPaul123 commented Jul 3, 2023

jtd-formlabs commented Jul 3, 2023 •

edited

jtd-formlabs commented Jul 3, 2023

DeveloperPaul123 commented Jul 3, 2023

jtd-formlabs commented Jul 3, 2023

Proposed fix for #40 #41

Proposed fix for #40 #41

Conversation

jtd-formlabs commented Jun 9, 2023

jtd-formlabs commented Jun 14, 2023

DeveloperPaul123 commented Jun 14, 2023

DeveloperPaul123 commented Jul 3, 2023

jtd-formlabs commented Jul 3, 2023

DeveloperPaul123 commented Jul 3, 2023

jtd-formlabs commented Jul 3, 2023 • edited

jtd-formlabs commented Jul 3, 2023

DeveloperPaul123 commented Jul 3, 2023

jtd-formlabs commented Jul 3, 2023

jtd-formlabs commented Jul 3, 2023 •

edited