Adding benchmark for thread-local object pool use case #374

franz1981 · 2023-02-25T21:58:31Z

This is a naive attempt to emulate a silly (but very soon quite frequent, believe me) use case: object pooling without making the acquired object to escape the consumer thread.

This is a pattern very frequent in HFT, but not only: as mentioned before, with Virtual Thread(s) and Loom, we need alternatives to cover for the lack of per-carrier thread locals (see https://github.com/FasterXML/jackson-core/blob/390472bbdf4722fe058f48bb0eff5865c8d20f73/src/main/java/com/fasterxml/jackson/core/util/BufferRecyclers.java#L36).

The recent answer from Heinz K. at https://mail.openjdk.org/pipermail/loom-dev/2023-February/005320.html
made me ask; we suggest users to use our mpmc qs as object pool, but...they are the right/decent tools for the job?

With 3 threads:

Benchmark                               (burstSize)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt     Score     Error  Units
QueueAsPoolBurstCost.acquireAndRelease            1       132000                         None      true  avgt   10    15.755 ±   0.309  ns/op
QueueAsPoolBurstCost.acquireAndRelease            1       132000           ArrayBlockingQueue      true  avgt   10   304.429 ±  14.945  ns/op
QueueAsPoolBurstCost.acquireAndRelease            1       132000        ConcurrentLinkedQueue      true  avgt   10   651.487 ±  27.480  ns/op
QueueAsPoolBurstCost.acquireAndRelease            1       132000  MpmcUnboundedXaddArrayQueue      true  avgt   10   404.714 ±   5.914  ns/op
QueueAsPoolBurstCost.acquireAndRelease            1       132000               MpmcArrayQueue      true  avgt   10   451.024 ±   7.468  ns/op
QueueAsPoolBurstCost.acquireAndRelease           10       132000                         None      true  avgt   10    82.535 ±   0.862  ns/op
QueueAsPoolBurstCost.acquireAndRelease           10       132000           ArrayBlockingQueue      true  avgt   10  2884.411 ± 417.531  ns/op
QueueAsPoolBurstCost.acquireAndRelease           10       132000        ConcurrentLinkedQueue      true  avgt   10  5856.916 ± 439.367  ns/op
QueueAsPoolBurstCost.acquireAndRelease           10       132000  MpmcUnboundedXaddArrayQueue      true  avgt   10  4304.779 ± 597.593  ns/op
QueueAsPoolBurstCost.acquireAndRelease           10       132000               MpmcArrayQueue      true  avgt   10  4395.881 ± 452.933  ns/op

I have introduced a fake CPU consume method (likely in the wrong place, given that's not emulating what real users will do - meaning we should add some before acquire as well, likely) and numbers, as usual appear slightly different:

Benchmark                               (burstSize)  (qCapacity)                      (qType)  (warmup)  (work)  Mode  Cnt     Score    Error  Units
QueueAsPoolBurstCost.acquireAndRelease            1       132000                         None      true       0  avgt   10    16.139 ±  0.182  ns/op
QueueAsPoolBurstCost.acquireAndRelease            1       132000                         None      true      10  avgt   10    30.127 ±  0.382  ns/op
QueueAsPoolBurstCost.acquireAndRelease            1       132000                         None      true     100  avgt   10   275.163 ±  1.003  ns/op
QueueAsPoolBurstCost.acquireAndRelease            1       132000           ArrayBlockingQueue      true       0  avgt   10   317.676 ± 11.457  ns/op
QueueAsPoolBurstCost.acquireAndRelease            1       132000           ArrayBlockingQueue      true      10  avgt   10   367.082 ± 16.133  ns/op
QueueAsPoolBurstCost.acquireAndRelease            1       132000           ArrayBlockingQueue      true     100  avgt   10  1180.306 ± 40.263  ns/op
QueueAsPoolBurstCost.acquireAndRelease            1       132000        ConcurrentLinkedQueue      true       0  avgt   10   664.938 ± 19.703  ns/op
QueueAsPoolBurstCost.acquireAndRelease            1       132000        ConcurrentLinkedQueue      true      10  avgt   10   620.152 ± 26.502  ns/op
QueueAsPoolBurstCost.acquireAndRelease            1       132000        ConcurrentLinkedQueue      true     100  avgt   10   653.506 ± 18.155  ns/op
QueueAsPoolBurstCost.acquireAndRelease            1       132000  MpmcUnboundedXaddArrayQueue      true       0  avgt   10   392.326 ±  9.412  ns/op
QueueAsPoolBurstCost.acquireAndRelease            1       132000  MpmcUnboundedXaddArrayQueue      true      10  avgt   10   397.462 ±  4.952  ns/op
QueueAsPoolBurstCost.acquireAndRelease            1       132000  MpmcUnboundedXaddArrayQueue      true     100  avgt   10   461.849 ± 23.687  ns/op
QueueAsPoolBurstCost.acquireAndRelease            1       132000               MpmcArrayQueue      true       0  avgt   10   455.009 ± 10.099  ns/op
QueueAsPoolBurstCost.acquireAndRelease            1       132000               MpmcArrayQueue      true      10  avgt   10   456.501 ± 14.344  ns/op
QueueAsPoolBurstCost.acquireAndRelease            1       132000               MpmcArrayQueue      true     100  avgt   10   483.445 ±  4.849  ns/op

Where "high" work now make the ArrayBlockingQueue a way less appealing solution (as expected? by me at least), but how much "high" translate into real-world "work" is not clear yet and makes me guess if our queues are good performer for this use case in realistic cases.

I'm aware that under very high contention lock-free queues aren't the best performer(s), but I wasn't expecting to happen with just 3 thread(s), meaning that I'm doing something wrong or not considering other factors.

franz1981 · 2023-03-02T16:03:34Z

@nitsanw you didn't gave me your opinion!! :P:P

Adding benchmark for thread-local object pool use case

06989f6

franz1981 requested a review from nitsanw February 25, 2023 21:58

franz1981 added 2 commits February 25, 2023 23:11

adding work emulator

8cb0282

Fix warmup and diamond op

f8ffed2

nitsanw merged commit 25c1a28 into JCTools:master Mar 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding benchmark for thread-local object pool use case #374

Adding benchmark for thread-local object pool use case #374

franz1981 commented Feb 25, 2023 •

edited

franz1981 commented Mar 2, 2023

Adding benchmark for thread-local object pool use case #374

Adding benchmark for thread-local object pool use case #374

Conversation

franz1981 commented Feb 25, 2023 • edited

franz1981 commented Mar 2, 2023

franz1981 commented Feb 25, 2023 •

edited