Performance significantly worse than libp2p mplex #42

marcus-pousette · 2023-04-26T13:56:45Z

I am considering replacing all my dependencies on @libp2p/mplex with this yamux implementation in my project.

However, I am experiencing significant performance loss with this change. In my benchmark, where I do replication work between multiple nodes I am able to do around 1200 tps with @libp2p/mplex but only around 700 with yamux.

I have not deep dived into the details of mplex and yamux, but is it technically possible to reach same performance in a yamux implementation as with mplex? Or am I seeing loss of performance as a cost of the features yamux brings to the table?

The text was updated successfully, but these errors were encountered:

marcus-pousette · 2023-04-26T14:13:18Z

Some obvious things I see right away

~~1).~~

js-libp2p-yamux/src/encode.ts

Line 4 in 4007dc4

const frame = new Uint8Array(HEADER_LENGTH)

~~This should instead be something like~~

Turns out that below is not faster

// In the top of the file
const allocUnsafeFn = (): (len: number) => Uint8Array => {
  if (globalThis.Buffer) {
    return globalThis.Buffer.allocUnsafe
  }
  return (len) => new Uint8Array(len);
}
const allocUnsafe = allocUnsafeFn();

// Later
const frame = allocUnsafe(HEADER_LENGTH)

2).

In the decoder, are we using the data length available in the header to allocate the right size for the Uint8ArrayList? It seems to currently only be used as a break condition.

marcus-pousette · 2023-04-26T15:23:05Z

~~With PR #43 one of the most obvious things was addressed.~~

Still left

Decoder does not seems to use the data length to pre allocated (?). Would perhaps explain why large messages performs poorly
Log messages seem to cost a bit. Especially stringifyHeader in

js-libp2p-yamux/src/muxer.ts

Line 379 in 4007dc4

this.log?.trace('received frame %s', stringifyHeader(header))

p-shahi · 2023-04-26T19:42:32Z

@marcus-pousette thanks for creating this issue, afaik yamux should have no performance penalties.

@MarcoPolo can you confirm? additionally, once libp2p/test-plans has benchmarking setup for js-libp2p, we should be able to observe this discrepancy between mplex and yamux right?

marcus-pousette · 2023-04-26T19:46:59Z

I see, then it is in the implementation details I guess.

There is a benchmark setup in this lib, and the output also shows that there is a discrepancy

I am running the benchmark from master and get this

✔ yamux send and receive 1 0.0625KB chunks                            15954.82 ops/s    62.67700 us/op   x0.943      11068 runs   1.37 s
✔ yamux send and receive 1 1KB chunks                                 15674.96 ops/s    63.79600 us/op   x0.991       6594 runs  0.820 s
✔ yamux send and receive 1 64KB chunks                                12712.78 ops/s    78.66100 us/op   x0.987       8240 runs   1.14 s
✔ yamux send and receive 1 1024KB chunks                              2628.176 ops/s    380.4920 us/op   x0.971       1667 runs   1.14 s
✔ yamux send and receive 1000 0.0625KB chunks                         126.6361 ops/s    7.896644 ms/op   x0.978         40 runs  0.831 s
✔ yamux send and receive 1000 1KB chunks                              119.9003 ops/s    8.340262 ms/op   x0.959         51 runs  0.940 s
✔ yamux send and receive 1000 64KB chunks                             37.21731 ops/s    26.86922 ms/op   x0.994         21 runs   1.08 s
✔ yamux send and receive 1000 1024KB chunks                           3.227550 ops/s    309.8325 ms/op   x1.002          6 runs   2.46 s
✔ mplex send and receive 1 0.0625KB chunks                            13847.92 ops/s    72.21300 us/op   x1.046       8967 runs   1.19 s
✔ mplex send and receive 1 1KB chunks                                 14239.74 ops/s    70.22600 us/op   x1.045       8129 runs   1.03 s
✔ mplex send and receive 1 64KB chunks                                12067.97 ops/s    82.86400 us/op   x0.987       7406 runs   1.04 s
✔ mplex send and receive 1 1024KB chunks                              3554.330 ops/s    281.3470 us/op   x1.014       2787 runs   1.24 s
✔ mplex send and receive 1000 0.0625KB chunks                         312.3817 ops/s    3.201212 ms/op   x1.004        255 runs   1.34 s
✔ mplex send and receive 1000 1KB chunks                              270.3823 ops/s    3.698467 ms/op   x0.954        274 runs   1.54 s
✔ mplex send and receive 1000 64KB chunks                             48.38998 ops/s    20.66543 ms/op   x0.856         42 runs   1.39 s
✔ mplex send and receive 1000 1024KB chunks                           3.216166 ops/s    310.9292 ms/op   x0.851          4 runs   1.82 s

There is almost a factor 2x difference between yamux and mplex for some of the runs here

marcus-pousette · 2023-04-26T19:52:50Z

I tried to run the VSCode profiler for the benchmark tasks but the results given are very opaque, sadly (fail to get any real insights)

achingbrain · 2023-04-26T19:53:27Z

Yamux has historically been slower than mplex but it’s not had the same amount of profiling applied so there is almost certainly some low hanging fruit to be had.

@marcus-pousette thank you for looking into this

marcus-pousette · 2023-04-26T22:19:38Z

Yeap! It should be a low hanging fruit for this one.

I have not found anything critical yet, but...

It "feels" like readHeader and readBytes could be improved if consume method would return the sliced header instead. Now both this.buffer.slice(...) and this.buffer.consume(...) will do an equivalent iteration on the underlying list, to perform its purposes. But something like this.buffer.splice would be interesting to see.
For the Uint8ArrayList related things, we are doing a lot of append, sublist. and consume where the Uint8ArrayList (this.buffer) usually only contains one Uint8Array element. If you do a special case implementation for Uint8arrayList to perform better when N = 1, you could get roughly 20% performance gain on "append". I have not checked the other operations yet, but I assume the consume will be faster also, since we do shift() https://github.com/achingbrain/uint8arraylist/blob/0adda9ad78c3db75bee73a63acce5aee8c5f0f76/src/index.ts#L167
The Uint8ArrayList could perform better (perhaps) on consume if the underlying list is a linked list, so that we don't need to do shift() at all
In Yamux we perhaps don't need construct and pass arguments as objects for private function that are invoked often.
Logging does not seem to have too much effect on performance, overall.

marcus-pousette · 2023-08-06T12:46:30Z

Posting benchmark from running it today

codec benchmark
✔ frame header - encodeFrameHeader 8130081 ops/s 123.0000 ns/op - 1131753 runs 0.404 s
✔ frame header - encodeFrameHeaderNaive 2118644 ops/s 472.0000 ns/op - 2241054 runs 1.62 s
✔ frame header decodeHeader 9345794 ops/s 107.0000 ns/op - 1306895 runs 0.505 s
✔ frame header decodeHeaderNaive 2325581 ops/s 430.0000 ns/op - 2834015 runs 2.02 s

comparison benchmark
✔ yamux send and receive 1 0.0625KB chunks 13620.27 ops/s 73.42000 us/op - 9929 runs 1.41 s
✔ yamux send and receive 1 1KB chunks 14097.21 ops/s 70.93600 us/op - 6689 runs 0.924 s
✔ yamux send and receive 1 64KB chunks 11133.63 ops/s 89.81800 us/op - 5943 runs 0.950 s
✔ yamux send and receive 1 1024KB chunks 2604.350 ops/s 383.9730 us/op - 1658 runs 1.14 s
✔ yamux send and receive 1000 0.0625KB chunks 123.3899 ops/s 8.104389 ms/op - 40 runs 0.835 s
✔ yamux send and receive 1000 1KB chunks 112.0145 ops/s 8.927416 ms/op - 25 runs 0.735 s
✔ yamux send and receive 1000 64KB chunks 34.73366 ops/s 28.79052 ms/op - 53 runs 2.04 s
✔ yamux send and receive 1000 1024KB chunks 2.772760 ops/s 360.6515 ms/op - 10 runs 4.24 s
✔ mplex send and receive 1 0.0625KB chunks 14712.59 ops/s 67.96900 us/op - 9807 runs 1.22 s
✔ mplex send and receive 1 1KB chunks 14861.71 ops/s 67.28700 us/op - 12154 runs 1.32 s
✔ mplex send and receive 1 64KB chunks 13063.70 ops/s 76.54800 us/op - 6321 runs 0.820 s
✔ mplex send and receive 1 1024KB chunks 3554.229 ops/s 281.3550 us/op - 4373 runs 1.74 s
✔ mplex send and receive 1000 0.0625KB chunks 230.5917 ops/s 4.336670 ms/op - 189 runs 1.34 s
✔ mplex send and receive 1000 1KB chunks 206.2928 ops/s 4.847479 ms/op - 338 runs 2.16 s
✔ mplex send and receive 1000 64KB chunks 47.07421 ops/s 21.24305 ms/op - 21 runs 0.969 s
✔ mplex send and receive 1000 1024KB chunks 3.041689 ops/s 328.7647 ms/op - 10 runs 3.95 s

Still see a big difference in performance for many of the cases. Perhaps especially for
"send and receive 1000 0.0625KB"

marcus-pousette · 2023-08-06T12:59:28Z

The mplex implementation using a "buffer pool" for the encoder, while this implementation is not. I wonder whether that could play a part here: That there is a lot of allocation/deallocation of memory that only lives for a short amount of time

achingbrain · 2024-01-16T16:22:07Z

We've done some real-world benchmarking which has led to some performance improvements and the results are that js-libp2p has the fastest streaming performance of the libp2p implementations tested - https://observablehq.com/@libp2p-workspace/performance-dashboard

I'm going to close this as yamux has closed the performance gap and has back pressure so is vastly preferable to mplex.

Please re-open if you're still seeing a serious degradation.

marcus-pousette mentioned this issue Apr 26, 2023

perf: use allocUnsafe in node #43

Closed

wemeetagain mentioned this issue May 5, 2023

fix: improve decode performance with subarray #49

Merged

p-shahi mentioned this issue Jun 15, 2023

stop testing mplex libp2p/test-plans#196

Open

marcus-pousette mentioned this issue Aug 6, 2023

Mplex aborts connections that I want to keep libp2p/js-libp2p#1925

Closed

p-shahi mentioned this issue Aug 18, 2023

multidim-interop: stop testing mplex libp2p/test-plans#263

Closed

maschad mentioned this issue Aug 23, 2023

feat: add js-libp2p perf tests libp2p/test-plans#244

Merged

SgtPooki mentioned this issue Oct 3, 2023

perf: Integrate unit and spec benchmarking in js-libp2p CI libp2p/js-libp2p#2118

Open

maschad mentioned this issue Oct 3, 2023

perf: Add component level performance benchmarks per PR libp2p/js-libp2p#2120

Open

achingbrain closed this as completed Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance significantly worse than libp2p mplex #42

Performance significantly worse than libp2p mplex #42

marcus-pousette commented Apr 26, 2023 •

edited

Loading

marcus-pousette commented Apr 26, 2023 •

edited

Loading

marcus-pousette commented Apr 26, 2023 •

edited

Loading

p-shahi commented Apr 26, 2023 •

edited

Loading

marcus-pousette commented Apr 26, 2023 •

edited

Loading

marcus-pousette commented Apr 26, 2023 •

edited

Loading

achingbrain commented Apr 26, 2023

marcus-pousette commented Apr 26, 2023

marcus-pousette commented Aug 6, 2023

marcus-pousette commented Aug 6, 2023

achingbrain commented Jan 16, 2024

Performance significantly worse than libp2p mplex #42

Performance significantly worse than libp2p mplex #42

Comments

marcus-pousette commented Apr 26, 2023 • edited Loading

marcus-pousette commented Apr 26, 2023 • edited Loading

marcus-pousette commented Apr 26, 2023 • edited Loading

p-shahi commented Apr 26, 2023 • edited Loading

marcus-pousette commented Apr 26, 2023 • edited Loading

marcus-pousette commented Apr 26, 2023 • edited Loading

achingbrain commented Apr 26, 2023

marcus-pousette commented Apr 26, 2023

marcus-pousette commented Aug 6, 2023

marcus-pousette commented Aug 6, 2023

achingbrain commented Jan 16, 2024

marcus-pousette commented Apr 26, 2023 •

edited

Loading

marcus-pousette commented Apr 26, 2023 •

edited

Loading

marcus-pousette commented Apr 26, 2023 •

edited

Loading

p-shahi commented Apr 26, 2023 •

edited

Loading

marcus-pousette commented Apr 26, 2023 •

edited

Loading

marcus-pousette commented Apr 26, 2023 •

edited

Loading