Threadpool: take 2 #8

fmz · 2024-07-17T15:56:46Z

Added an API to support explicit management and fine-grain control of threadpools.
The API supports creating different threadpools for various parts of execution, e.g. batch, single-token, etc. Each threadpool can be created, paused, resumed, and released independently from any other threadpools. This mitigates the overhead of starting/stopping threads for each decode call and helps OSes keep track of scheduling history in order to make better scheduling decisions.

Each threadpool supports:

Setting number of threads (duh)
Setting a CPU mask for threads to be placed on
Support for strict/relaxed placement: pinning specific threads to specific cores, or letting the OS decide
Support for polling/interrupt-driven wait
Setting thread priority
Using threadpools explicitly is optional. If a llama_decode is called with a llama_context that doesn't have a threadpool attached, a disposable threadpool is created (same as the current behavior).
If users choose to explicitly use threadpools, they have to manage them manually. See examples in main.cpp and in speculative.cpp.

With all the bells and whistles enabled, we generally see .25-1 tok/s improvement across the board.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

fmz · 2024-07-17T15:59:14Z

CMakePresets.json

-            "CMAKE_EXPORT_COMPILE_COMMANDS": "ON",
-            "CMAKE_INSTALL_RPATH": "$ORIGIN;$ORIGIN/.."
+    "version": 4,
+    "configurePresets": [


Revert all of this

fmz · 2024-07-19T17:19:23Z

On W-2225 Xeon machine: CPU backend:

CPU	Model	Test	t/s master	t/s threadpool-attempt-2	Speedup
Intel(R) Xeon(R) W-2225 CPU @ 4.10GHz	llama 7B Q4_0	pp512	17.57	17.56	1.00
Intel(R) Xeon(R) W-2225 CPU @ 4.10GHz	llama 7B Q4_0	tg128	6.93	7.14	1.03

$ ./build/bin/llama-bench -t 8

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	8	pp512	18.38 ± 0.46
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	8	tg128	8.37 ± 0.01

fmz · 2024-07-22T17:35:25Z

./scripts/compare-commits.sh master threadpool-attempt-2 -t 1,2,4,6,8,10

CPU	Model	Threads	Test	t/s master	t/s threadpool-attempt-2	Speedup
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	1	pp512	3.93	3.94	1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	1	tg128	2.43	2.44	1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	2	pp512	7.13	7.06	0.99
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	2	tg128	4.37	4.36	1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	4	pp512	11.96	11.99	1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	4	tg128	6.79	6.77	1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	6	pp512	14.96	14.98	1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	6	tg128	7.51	7.53	1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	8	pp512	13.06	13.09	1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	8	tg128	6.88	6.83	0.99
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	10	pp512	14.08	14.06	1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	10	tg128	7.49	7.52	1.00

fmz · 2024-07-22T22:01:27Z

$ LLAMA_CUDA=1 ./scripts/compare-commits.sh master threadpool-attempt-2 -nkvo 0,1
(No discernible difference between OpenMP and ggml_threadpool with default settings)

GPU	Model	NKVO	Test	t/s master	t/s threadpool-attempt-2	Speedup
RTX 3060 Laptop GPU	llama 7B Q4_0	No	pp512	1644.73	1642.34	1.00
RTX 3060 Laptop GPU	llama 7B Q4_0	No	tg128	65.94	65.89	1.00
RTX 3060 Laptop GPU	llama 7B Q4_0	Yes	pp512	287.28	286.44	1.00
RTX 3060 Laptop GPU	llama 7B Q4_0	Yes	tg128	54.56	54.32	1.00

- OpenMP functional: check - Vanilla ggml functional: Check - ggml w/threadpool functional: Check - OpenMP no regression: No glaring problems - Vanilla ggml no regression: No glaring problems - ggml w/threadpool no regression: No glaring problems

... facing segfaults on master ...

…gml-org#16038) Initalizing RESERVED_NAME in is_reserved_name() is not thread safe and leads to corrupted memory when used from multiple threads as can be seen in the asan trace below. This fixes the initialization to make it thread-safe. #0 0x000100abd018 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) __hash_table:1565 #1 0x000100ab0320 in SchemaConverter::visit(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) json-schema-to-grammar.cpp:802 #2 0x000100aafc48 in std::__1::__function::__func<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2, std::__1::allocator<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 #3 0x000100a2c938 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&), std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>, void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 #4 0x000100a139f8 in foreach_function(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::function<void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)> const&) chat.cpp:762 #5 0x000100a2a7f4 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0, std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0>, void (common_grammar_builder const&)>::operator()(common_grammar_builder const&) function.h:319 #6 0x000100aa98f4 in build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&) json-schema-to-grammar.cpp:982 #7 0x0001009c9314 in common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool) chat.cpp:1110 #8 0x0001009b8afc in common_chat_templates_apply_jinja(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:1992 #9 0x0001009b533c in common_chat_templates_apply(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:2074 #10 0x000100810120 in llamacpp_apply_chat_template+0x724 (predict_oai-98384e17fb94e863:arm64+0x100090120) ... ==45482==Register values: x[0] = 0x00006020004147f8 x[1] = 0x00006080000013c8 x[2] = 0x0000000000000000 x[3] = 0x0000604006289738 x[4] = 0x0000000000000002 x[5] = 0x0000000000000001 x[6] = 0x04034000004b4000 x[7] = 0x0000000000000001 x[8] = 0xbebebebebebebebe x[9] = 0x17d7d7d7d7d7d7d7 x[10] = 0x00000c04000828ff x[11] = 0x0000000000000001 x[12] = 0x000000002018d383 x[13] = 0x0000000000000000 x[14] = 0xfa0000000000fafa x[15] = 0x000010700001ffff x[16] = 0x000000019dc012c0 x[17] = 0x00000001021284f8 x[18] = 0x0000000000000000 x[19] = 0x00000001700acdc0 x[20] = 0x0000000000000002 x[21] = 0x000000002018d384 x[22] = 0x16dd16fd2e731151 x[23] = 0x0000007000020000 x[24] = 0x0000000100c69c08 x[25] = 0x0000000100c69c20 x[26] = 0x00006080000013c7 x[27] = 0x0000000100c69c00 x[28] = 0x00000001700acd60 fp = 0x00000001700aceb0 lr = 0x0000000100abce30 sp = 0x00000001700acd60 AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV __hash_table:1565 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) Thread T5 created by T0 here: #0 0x0001020b99d4 in pthread_create+0x5c (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x359d4) #1 0x000100873910 in std::sys::pal::unix::thread::Thread::new::h77254fdd87a28e05+0x118 (predict_oai-98384e17fb94e863:arm64+0x1000f3910) #2 0x0001007c7a1c in test::run_test::haeb3c2bcd5ed6cf6+0x76c (predict_oai-98384e17fb94e863:arm64+0x100047a1c) #3 0x0001007aedb0 in test::console::run_tests_console::he9d142d704f3a986+0x149c (predict_oai-98384e17fb94e863:arm64+0x10002edb0) #4 0x0001007c5758 in test::test_main::hf86a5e20735245b9+0x118 (predict_oai-98384e17fb94e863:arm64+0x100045758) #5 0x0001007c5da0 in test::test_main_static::h61ee9c8fd30abca0+0x54 (predict_oai-98384e17fb94e863:arm64+0x100045da0) ... ==45482==ABORTING

fmz marked this pull request as draft July 17, 2024 15:56

github-actions bot added testing examples server ggml build labels Jul 17, 2024

fmz commented Jul 17, 2024

View reviewed changes

fmz force-pushed the threadpool-attempt-2 branch 3 times, most recently from d50e63d to 35b447a Compare July 17, 2024 19:38

fmz changed the title ~~threadpool <WIP>~~ Threadpool: take 2 Jul 17, 2024

fmz force-pushed the threadpool-attempt-2 branch from 35b447a to e357251 Compare July 18, 2024 15:11

fmz marked this pull request as ready for review July 18, 2024 15:12

fmz requested a review from max-krasnyansky July 19, 2024 14:53

fmz force-pushed the threadpool-attempt-2 branch 2 times, most recently from 4a4e9f4 to 29338e9 Compare July 22, 2024 14:42

fmz force-pushed the threadpool-attempt-2 branch from 24b6ce4 to bbc47cf Compare July 23, 2024 20:15

fmz added 4 commits July 25, 2024 15:17

Introduce ggml_compute_threadpool

9328133

- OpenMP functional: check - Vanilla ggml functional: Check - ggml w/threadpool functional: Check - OpenMP no regression: No glaring problems - Vanilla ggml no regression: No glaring problems - ggml w/threadpool no regression: No glaring problems

uncomment cpu-relax

a4e97f3

re-enable speculative

bc7eaec

... facing segfaults on master ...

add _GNU_SOURCE

e317ab6

fmz force-pushed the threadpool-attempt-2 branch from 2605081 to e317ab6 Compare July 25, 2024 19:19

max-krasnyansky closed this Aug 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Threadpool: take 2 #8

Threadpool: take 2 #8

Uh oh!

fmz commented Jul 17, 2024 •

edited

Loading

Uh oh!

fmz Jul 17, 2024

Uh oh!

fmz commented Jul 19, 2024

Uh oh!

fmz commented Jul 22, 2024

Uh oh!

fmz commented Jul 22, 2024

Uh oh!

Uh oh!

Threadpool: take 2 #8

Threadpool: take 2 #8

Uh oh!

Conversation

fmz commented Jul 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fmz Jul 17, 2024

Choose a reason for hiding this comment

Uh oh!

fmz commented Jul 19, 2024

Uh oh!

fmz commented Jul 22, 2024

Uh oh!

fmz commented Jul 22, 2024

Uh oh!

Uh oh!

fmz commented Jul 17, 2024 •

edited

Loading