WIP: CI check -- threadpool with fewer barriers #11

max-krasnyansky · 2024-08-13T02:35:48Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

- OpenMP functional: check - Vanilla ggml functional: Check - ggml w/threadpool functional: Check - OpenMP no regression: No glaring problems - Vanilla ggml no regression: No glaring problems - ggml w/threadpool no regression: No glaring problems

and fix --poll case

…ool threads This way we avoid using E-Cores and Hyperthreaded siblings.

For benchmarking it's better to start a fresh pool for each test with the exact number of threads needed for that test. Having larger pools is suboptimal (causes more load, etc).

…r when polling in ggml_barrier This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.

All command line args now allow for setting poll to 0 (false).

We now start threadpool in paused state only if we have two. The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.

poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var. poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ... The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms. We can tune this further as things evolve.

New work is now indicated with an atomic counter that is incremented for each new graph that needs to be computed. This removes the need for extra barrier for clearing the "new_work" and removes the special case for trivial graphs.

With the efficient hybrid polling there is no need to make disposable pools any different. This simplifies the overall logic and reduces branching.

* oai moe * compat with new checkpoint * add attn sink impl * add rope scaling yarn * logits match with latest transformers code * wip chat template * rm trailing space * use ggml_scale_bias * rm redundant is_swa_all * convert interleaved gate_up * graph : fix activation function to match reference (#7) * vocab : handle o200k_harmony special tokens * ggml : add attention sinks support (#1) * llama : add attn sinks * ggml : add attn sinks * cuda : add attn sinks * vulkan : add support for sinks in softmax remove unnecessary return * ggml : add fused swiglu_oai op (#11) * ggml : add fused swiglu_oai op * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update CUDA impl * cont : metal impl * add vulkan impl * test-backend-ops : more test cases, clean up * llama : remove unfused impl * remove extra lines --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> * repack mxfp4 upon conversion * clean up a bit * enable thinking * add quick hack to render only some special tokens * fix bf16 conversion * remove vocab hack * webui ok * support chat parsing for gpt-oss * fix webui * direct mapping mxfp4, FINALLY * force using mxfp4 * properly use lazy tensor * ggml : add mxfp4 ggml : use e8m0 conversion instead of powf Co-authored-by: Diego Devesa <slarengh@gmail.com> change kvalues_mxfp4 table to match e2m1 (#6) metal : remove quantization for now (not used) cuda : fix disabled CUDA graphs due to ffn moe bias vulkan : add support for mxfp4 cont : add cm2 dequant * ggml : add ggml_add_id (#13) * ggml : add ggml_add_id * add cuda impl * llama : add weight support check for add_id * perf opt * add vulkan impl * rename cuda files * add metal impl * allow in-place ggml_add_id * llama : keep biases on CPU with --cpu-moe * llama : fix compile error ggml-ci * cuda : add fallback for __nv_cvt_e8m0_to_bf16raw ggml-ci * cleanup ggml-ci * sycl : fix supports_op for MXFP4 ggml-ci * fix Unknown reasoning format * ggml-cpu : fix AVX build ggml-ci * fix hip build ggml-ci * cuda : add mxfp4 dequantization support for cuBLAS ggml-ci * ggml-cpu : fix mxfp4 fallback definitions for some architectures ggml-ci * cuda : fix version required for __nv_cvt_e8m0_to_bf16raw --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: slaren <slarengh@gmail.com>

fmz and others added 17 commits August 9, 2024 08:07

Introduce ggml_compute_threadpool

c5d9f63

- OpenMP functional: check - Vanilla ggml functional: Check - ggml w/threadpool functional: Check - OpenMP no regression: No glaring problems - Vanilla ggml no regression: No glaring problems - ggml w/threadpool no regression: No glaring problems

Minor fixes

76d2461

fixed use after release bug

9bd4367

fixed a harmless race condition

dc33f83

Fix Android bulid issue

32048c7

fix more race conditions

81522b9

fix deadlock for cases where cgraph.n_nodes == 1

4512d1a

and fix --poll case

threadpool: use cpu_get_num_math to set the default number of threadp…

5f44e28

…ool threads This way we avoid using E-Cores and Hyperthreaded siblings.

bench: create fresh threadpool for each test

152fc73

For benchmarking it's better to start a fresh pool for each test with the exact number of threads needed for that test. Having larger pools is suboptimal (causes more load, etc).

atomics: always use stdatomics with clang and use relaxed memory orde…

3cfce8d

…r when polling in ggml_barrier This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.

threadpool: make polling the default to match openmp behavior

26ff44f

All command line args now allow for setting poll to 0 (false).

threadpool: do not wakeup threads in already paused threadpool

5c564e5

fix potential race condition in check_for_work

2201529

threadpool: do not create two threadpools if their params are identical

20db9f4

threadpool: reduce pause/resume/wakeup overhead in common cases

b18719b

We now start threadpool in paused state only if we have two. The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.

threadpool: reduce the number of barrier required

160fc8d

New work is now indicated with an atomic counter that is incremented for each new graph that needs to be computed. This removes the need for extra barrier for clearing the "new_work" and removes the special case for trivial graphs.

github-actions bot added testing examples server ggml labels Aug 13, 2024

threadpool: remove special-casing for disposable threadpools

fee969d

With the efficient hybrid polling there is no need to make disposable pools any different. This simplifies the overall logic and reduces branching.

max-krasnyansky closed this Aug 13, 2024

max-krasnyansky reopened this Aug 13, 2024

max-krasnyansky closed this Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: CI check -- threadpool with fewer barriers #11

WIP: CI check -- threadpool with fewer barriers #11

Uh oh!

max-krasnyansky commented Aug 13, 2024

Uh oh!

Uh oh!

WIP: CI check -- threadpool with fewer barriers #11

WIP: CI check -- threadpool with fewer barriers #11

Uh oh!

Conversation

max-krasnyansky commented Aug 13, 2024

Uh oh!

Uh oh!