Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ByteBuffer as a backing storage on JVM #239

Open
11 of 15 tasks
fzhinkin opened this issue Nov 24, 2023 · 8 comments
Open
11 of 15 tasks

Support ByteBuffer as a backing storage on JVM #239

fzhinkin opened this issue Nov 24, 2023 · 8 comments
Assignees

Comments

@fzhinkin
Copy link
Collaborator

fzhinkin commented Nov 24, 2023

java.nio.ByteBuffer is THE data container in Java NIO APIs. Those who need to use features provided only by the NIO APIs (like non-blocking sockets) are doomed to use ByteBuffer for data transferring. Those who need to achieve better performance or use IO interfaces unavailable in Java StdLib will end up using libraries that might roll out their own data containers but usually still allowing to wrap or directly use ByteBuffer (like Netty or Aeron does).

It's possible to wrap a heap-allocated byte array (the backing storage for kotlinx-io segments) into a HeapByteBuffer, but the use of heap buffers comes with a cost. The majority of NIO API calls eventually perform a native call. If such a call (for example, a native wrapper for POSIX write) needs data, then NIO will supply it in the form of DirectByteBuffer or a memory address extracted from the DirectByteBuffer. If a user had provided DirectByteBuffer, then that buffer will be used, but if it was a HeapByteBuffer, then its content will be copied into an internal cached DirectByteBuffer instance and only then passed to the native API. If the buffer is empty, then the copying cost could be neglected, but as the buffer grows, it starts playing a more significant role in overall performance.

Besides performance issues with NIO API, a buffer residing in native memory is a necessity when it comes to implementing Java API for not yet supported native IO APIs such as io_uring, send w/ MSG_ZEROCOPY flag, epoll in the edge-triggering mode, etc. The only available option for allocating such a buffer and using it in a wide range of JVM versions supported by the Kotlin is by using DirectByteBuffer.

Unfortunately, using direct byte buffers is not always an option:

  • some APIs don't directly support it on JVM (like MessageDigest)
  • manipulations with the buffer itself works significantly slower on Android

So the only viable option might be to support both byte-arrays and ByteBuffers as a backing storage and provide a way to choose what particular implementation to use when starting an app.

Tasks:

  • investigate ByteBuffers advantages/need to support it in kotlinx-io
  • publish results of BB performance investigation
  • evaluate kotlinx-io performance with DirectByteBuffer
  • publish performance characteristics of kotlinx-io w/ BB as a backing storage on JVM
  • refactor the library to allow using different Segment implementations
  • implement DirectByteBuffer-backed segments
  • investigate JDK22 MemorySegments usage instead of BB
  • implement polymorphic segment
  • port some benchmarks to Android
  • evaluate baseline performance on Android
  • evaluate DirectByteBuffers performance on Android
  • investigate R8 features/capabilities/issues
  • finalize and publish a design
  • test-library support for multiple segment types
  • tune performance (rewrite UTF8-manipulation routines, for example)
@fzhinkin
Copy link
Collaborator Author

The #135 will be done in the context of this project (at least partially).

@revonateB0T
Copy link

“manipulations with the buffer itself works significantly slower on Android“
DirectByteBuffer is actually non-movable bytearray allocated on dalvik heap on Android platform,still it will never be copied when GC so we don't need to pay an extra copy for native IO. Any benchmark indicates that we should not use DirectByteBuffer on Android?
@fzhinkin

@fzhinkin
Copy link
Collaborator Author

fzhinkin commented Dec 4, 2023

@VDostoyevskiy some time ago, I ran several kotilnx-io benchmarks on Android and saw a significant slowdown when DirectByteBuffer was used as a backing storage (compared to the baseline with ByteArray as a backing storage). At first glance, it looked like Art's JIT failed in ByteBuffer's methods inlining.
The current plan is to run an extended set of benchmarks on the device to verify the previous observation, I'm hoping to get down to it by the end of the week. I'll publish the results as soon as it's done.

@fzhinkin
Copy link
Collaborator Author

Slowly processing the task list.

publish results of BB performance investigation

Benchmarking results published here: https://github.com/fzhinkin/kotlinx-io-supplementary-benchmarks#kotlinx-io-supplementary-benchmarks

tl;dr
On JVM, writing or reading direct byte buffer via channel is usually faster then corresponding operation involving byte arrays and java.io streams.
The unexpected twist: on Android, byte buffer-based operations are ridiculously fast compared to their array-based java-io counterparts.

fzhinkin added a commit that referenced this issue Jan 8, 2024
@fzhinkin
Copy link
Collaborator Author

fzhinkin commented Jan 8, 2024

As it was mentioned, the next step toward deciding if and how byte buffers should be supported is to plug it into the library and run our benchmarks to see how it affects the performance.

All the table listed below are available as Google Docs spreadsheet here: https://docs.google.com/spreadsheets/d/19krIuAKL7zVv8zFMKtUeGtCAZcuqRkPp7QWvZPRa784/edit?usp=sharing

Raw benchmarking results: https://github.com/Kotlin/kotlinx-io/tree/design/dbb/docs/design/byte-buffers/benchmarking

The hypothesis to check is that at least on JVM (or maybe even on Android) we can replace byte arrays storing the data
in kotlinx.io.Segment with direct java.nio.ByteBuffer buffers without losing in performance.

I used code from several git branches for the analysis:

  • develop branch as the baseline
  • private/segments-public-api branch as an intermediate step before swapping byte arrays with byte buffers; this branch refactors segments and after cleanup and review will be integrated to partially cover Provide an API for performant bulk or sequential read/writes #135; this branch replaced explicit access to segments data with API calls and that simplified further integration of byte buffers; disregard being an intemediate branch, I added it to results to show how these particular changes affect overall performance;
  • private/dbb-benchmarking branch build upon segment-public-api, where segment's byte arrays were replaced with byte buffers;
  • private/dbb-benchmarking-unsafe - the same as above, but with some operations using Unsafe for reading from and writing into a ByteBuffer (more on that later).

The first two tables below represents results collected using a "core" subset of kotlinx-io benchmarks and their versions ported to androidx-benchmark.

Improvement column contains the speedup relative to the baseline (develop branch-based, in all cases), computed as 100% * (baseline - alternative) / baseline. If a code in alternative branch performs better than baseline, this value is positive, otherwise - negative. N/A means that comparison results are not available (because CIs for mean overlapped and I can't say which result is actually better; yes, it's not the best way to check results' significance).

JVM results

Benchmark Parameters Baseline, avg. time 0.999-CI Seg. public API, avg. time 0.999-CI Improvement DirectBB, avg. time 0.999-CI Improvement DirectBB w/ Unsafe, avg. time 0.999-CI Improvement
kx.io.b.BufferReadNewByteArray.benchmark size=1 8.677 ns ±0.031 ns 8.823 ns ±0.197 ns N/A 18.803 ns ±0.354 ns -116.7 % 19.826 ns ±0.994 ns -128.5 %
kx.io.b.BufferReadNewByteArray.benchmark size=1024 70.959 ns ±1.791 ns 70.490 ns ±0.055 ns N/A 74.719 ns ±2.332 ns N/A 71.761 ns ±1.373 ns N/A
kx.io.b.BufferReadNewByteArray.benchmark size=24576 2.092 us ±0.014 us 2.119 us ±0.013 us -1.3 % 2.126 us ±0.008 us -1.6 % 2.111 us ±0.006 us -0.9 %
kx.io.b.BufferReadWriteByteArray.benchmark size=1 7.618 ns ±0.138 ns 7.446 ns ±0.016 ns 2.3 % 20.003 ns ±0.072 ns -162.6 % 18.876 ns ±3.276 ns -147.8 %
kx.io.b.BufferReadWriteByteArray.benchmark size=1024 32.496 ns ±2.299 ns 30.441 ns ±1.666 ns N/A 40.562 ns ±2.334 ns -24.8 % 35.648 ns ±2.632 ns N/A
kx.io.b.BufferReadWriteByteArray.benchmark size=24576 654.400 ns ±12.697 ns 670.191 ns ±20.315 ns N/A 692.064 ns ±33.039 ns N/A 703.603 ns ±53.146 ns N/A
kx.io.b.DecimalLongBenchmark.benchmark value='-9223372036854775806' 78.690 ns ±2.792 ns 48.928 ns ±0.194 ns 37.8 % 58.192 ns ±0.156 ns 26.0 % 50.903 ns ±0.113 ns 35.3 %
kx.io.b.DecimalLongBenchmark.benchmark value='9223372036854775806' 76.946 ns ±12.448 ns 48.157 ns ±0.094 ns 37.4 % 57.621 ns ±3.014 ns 25.1 % 49.928 ns ±0.234 ns 35.1 %
kx.io.b.DecimalLongBenchmark.benchmark value='1' 10.490 ns ±0.020 ns 9.207 ns ±0.226 ns 12.2 % 10.912 ns ±0.048 ns -4.0 % 9.706 ns ±0.155 ns 7.5 %
kx.io.b.HexadecimalLongBenchmark.benchmark value='9223372036854775806' 49.183 ns ±0.177 ns 33.212 ns ±0.140 ns 32.5 % 37.907 ns ±0.691 ns 22.9 % 32.799 ns ±0.179 ns 33.3 %
kx.io.b.HexadecimalLongBenchmark.benchmark value='1' 14.334 ns ±0.029 ns 11.669 ns ±0.153 ns 18.6 % 15.609 ns ±0.044 ns -8.9 % 11.420 ns ±0.055 ns 20.3 %
kx.io.b.IndexOfBenchmark.benchmark params='128:0:-1' 24.181 ns ±0.051 ns 24.824 ns ±0.111 ns -2.7 % 35.645 ns ±0.103 ns -47.4 % 31.842 ns ±0.074 ns -31.7 %
kx.io.b.IndexOfBenchmark.benchmark params='128:0:7' 5.989 ns ±0.020 ns 5.769 ns ±0.036 ns 3.7 % 18.053 ns ±15.223 ns N/A 5.716 ns ±0.033 ns 4.6 %
kx.io.b.IndexOfBenchmark.benchmark params='128:0:100' 19.668 ns ±0.113 ns 20.692 ns ±0.110 ns -5.2 % 26.891 ns ±0.035 ns -36.7 % 26.173 ns ±0.157 ns -33.1 %
kx.io.b.IndexOfBenchmark.benchmark params='128:8128:100' 27.298 ns ±0.407 ns 27.097 ns ±0.206 ns N/A 34.514 ns ±0.254 ns -26.4 % 31.714 ns ±0.060 ns -16.2 %
kx.io.b.IndexOfBenchmark.benchmark params='24576:0:-1' 3.600 us ±0.252 us 3.744 us ±0.010 us N/A 5.614 us ±0.118 us -56.0 % 5.599 us ±0.013 us -55.5 %
kx.io.b.IndexOfByteString.benchmark params='1024:2' 1.470 us ±0.003 us 1.289 us ±0.013 us 12.4 % 1.351 us ±0.004 us 8.1 % 1.047 us ±0.002 us 28.8 %
kx.io.b.IndexOfByteString.benchmark params='8192:2' 11.658 us ±0.033 us 10.370 us ±0.036 us 11.0 % 10.642 us ±0.113 us 8.7 % 8.253 us ±0.012 us 29.2 %
kx.io.b.IndexOfByteString.benchmark params='10000:2' 13.983 us ±0.021 us 12.418 us ±0.089 us 11.2 % 13.180 us ±0.141 us 5.7 % 10.140 us ±0.096 us 27.5 %
kx.io.b.IndexOfByteString.benchmark params='10000:8' 29.332 us ±0.709 us 26.705 us ±0.078 us 9.0 % 43.945 us ±0.104 us -49.8 % 25.455 us ±0.055 us 13.2 %
kx.io.b.ByteBenchmark.benchmark 3.230 ns ±0.142 ns 3.034 ns ±0.063 ns N/A 6.284 ns ±0.060 ns -94.6 % 3.287 ns ±0.061 ns N/A
kx.io.b.IntBenchmark.benchmark 3.862 ns ±0.007 ns 3.530 ns ±0.009 ns 8.6 % 4.124 ns ±0.020 ns -6.8 % 4.102 ns ±0.015 ns -6.2 %
kx.io.b.IntLeBenchmark.benchmark 4.072 ns ±0.025 ns 3.741 ns ±0.011 ns 8.1 % 4.206 ns ±0.025 ns -3.3 % 4.208 ns ±0.038 ns -3.3 %
kx.io.b.LongBenchmark.benchmark 6.068 ns ±0.051 ns 5.284 ns ±0.043 ns 12.9 % 4.101 ns ±0.014 ns 32.4 % 4.138 ns ±0.013 ns 31.8 %
kx.io.b.LongLeBenchmark.benchmark 6.843 ns ±0.096 ns 6.283 ns ±0.016 ns 8.2 % 4.977 ns ±0.010 ns 27.3 % 5.158 ns ±0.010 ns 24.6 %
kx.io.b.ShortBenchmark.benchmark 3.333 ns ±0.016 ns 3.334 ns ±0.047 ns N/A 4.109 ns ±0.004 ns -23.3 % 4.098 ns ±0.013 ns -22.9 %
kx.io.b.ShortLeBenchmark.benchmark 3.381 ns ±0.030 ns 3.404 ns ±0.020 ns N/A 4.134 ns ±0.042 ns -22.3 % 4.220 ns ±0.140 ns -24.8 %
kx.io.b.Utf8LineBenchmark.benchmark length=17, separator='LF' 43.060 ns ±0.096 ns 42.598 ns ±0.851 ns N/A 52.416 ns ±0.200 ns -21.7 % 45.684 ns ±0.678 ns -6.1 %
kx.io.b.Utf8LineBenchmark.benchmark length=17, separator='CRLF' 44.468 ns ±0.096 ns 43.776 ns ±0.137 ns 1.6 % 51.660 ns ±0.095 ns -16.2 % 44.960 ns ±1.098 ns N/A
kx.io.b.Utf8LineStrictBenchmark.benchmark length=17, separator='LF' 43.517 ns ±0.267 ns 44.060 ns ±0.149 ns -1.2 % 51.769 ns ±1.095 ns -19.0 % 45.266 ns ±0.696 ns -4.0 %
kx.io.b.Utf8LineStrictBenchmark.benchmark length=17, separator='CRLF' 44.092 ns ±0.165 ns 43.578 ns ±0.362 ns N/A 52.709 ns ±0.638 ns -19.5 % 45.587 ns ±0.364 ns -3.4 %
kx.io.b.Utf8StringBenchmark.benchmark encoding='ascii', length=20 36.375 ns ±0.499 ns 34.739 ns ±0.176 ns 4.5 % 41.216 ns ±0.107 ns -13.3 % 35.824 ns ±1.050 ns N/A
kx.io.b.Utf8StringBenchmark.benchmark encoding='ascii', length=2000 1.576 us ±0.008 us 1.634 us ±0.008 us -3.7 % 1.755 us ±0.005 us -11.4 % 1.753 us ±0.006 us -11.3 %
kx.io.b.Utf8StringBenchmark.benchmark encoding='ascii', length=200000 180.333 us ±0.497 us 181.497 us ±0.270 us -0.6 % 180.052 us ±0.588 us N/A 179.222 us ±0.553 us 0.6 %
kx.io.b.Utf8StringBenchmark.benchmark encoding='utf8', length=20 79.311 ns ±3.425 ns 91.642 ns ±0.409 ns -15.5 % 106.419 ns ±4.981 ns -34.2 % 85.575 ns ±0.869 ns -7.9 %
kx.io.b.Utf8StringBenchmark.benchmark encoding='utf8', length=2000 9.013 us ±0.077 us 9.389 us ±0.035 us -4.2 % 10.150 us ±0.033 us -12.6 % 8.534 us ±0.088 us 5.3 %
kx.io.b.Utf8StringBenchmark.benchmark encoding='utf8', length=200000 913.628 us ±29.243 us 932.226 us ±22.271 us N/A 1051.384 us ±4.137 us -15.1 % 900.200 us ±1.462 us N/A
kx.io.b.Utf8StringBenchmark.benchmark encoding='sparse', length=20 54.438 ns ±0.394 ns 53.043 ns ±0.121 ns 2.6 % 63.510 ns ±0.107 ns -16.7 % 54.060 ns ±0.124 ns N/A
kx.io.b.Utf8StringBenchmark.benchmark encoding='sparse', length=2000 2.225 us ±0.011 us 2.137 us ±0.020 us 3.9 % 2.289 us ±0.007 us -2.9 % 2.329 us ±0.015 us -4.7 %
kx.io.b.Utf8StringBenchmark.benchmark encoding='sparse', length=200000 234.449 us ±0.697 us 255.164 us ±1.010 us -8.8 % 229.776 us ±1.495 us 2.0 % 246.223 us ±7.980 us -5.0 %
kx.io.b.Utf8StringBenchmark.benchmark encoding='2bytes', length=20 144.603 ns ±0.679 ns 99.734 ns ±0.383 ns 31.0 % 110.526 ns ±0.511 ns 23.6 % 98.628 ns ±0.297 ns 31.8 %
kx.io.b.Utf8StringBenchmark.benchmark encoding='2bytes', length=2000 10.409 us ±0.435 us 7.970 us ±0.030 us 23.4 % 8.655 us ±0.272 us 16.8 % 7.650 us ±0.028 us 26.5 %
kx.io.b.Utf8StringBenchmark.benchmark encoding='2bytes', length=200000 1077.305 us ±3.276 us 844.984 us ±1.783 us 21.6 % 865.833 us ±13.570 us 19.6 % 792.607 us ±1.371 us 26.4 %
kx.io.b.Utf8StringBenchmark.benchmark encoding='3bytes', length=20 147.238 ns ±2.388 ns 114.884 ns ±0.582 ns 22.0 % 128.365 ns ±3.056 ns 12.8 % 115.223 ns ±0.667 ns 21.7 %
kx.io.b.Utf8StringBenchmark.benchmark encoding='3bytes', length=2000 11.732 us ±0.019 us 9.599 us ±0.024 us 18.2 % 10.880 us ±0.025 us 7.3 % 9.498 us ±0.018 us 19.0 %
kx.io.b.Utf8StringBenchmark.benchmark encoding='3bytes', length=200000 1204.980 us ±3.702 us 989.178 us ±3.210 us 17.9 % 1.109 ms ±0.001 ms 8.0 % 983.792 us ±1.027 us 18.4 %
kx.io.b.Utf8StringBenchmark.benchmark encoding='4bytes', length=20 93.512 ns ±0.391 ns 82.132 ns ±0.910 ns 12.2 % 97.382 ns ±0.254 ns -4.1 % 81.331 ns ±0.402 ns 13.0 %
kx.io.b.Utf8StringBenchmark.benchmark encoding='4bytes', length=2000 8.347 us ±0.063 us 7.161 us ±0.032 us 14.2 % 8.130 us ±0.033 us 2.6 % 7.100 us ±0.021 us 14.9 %
kx.io.b.Utf8StringBenchmark.benchmark encoding='4bytes', length=200000 873.595 us ±3.063 us 750.184 us ±2.393 us 14.1 % 813.925 us ±4.698 us 6.8 % 719.284 us ±1.231 us 17.7 %
kx.io.b.Utf8StringBenchmark.benchmark encoding='bad', length=20 95.384 ns ±0.355 ns 101.927 ns ±2.935 ns -6.9 % 110.901 ns ±1.039 ns -16.3 % 102.409 ns ±0.212 ns -7.4 %
kx.io.b.Utf8StringBenchmark.benchmark encoding='bad', length=2000 7.536 us ±0.059 us 8.653 us ±0.310 us -14.8 % 9.930 us ±0.655 us -31.8 % 9.156 us ±0.074 us -21.5 %
kx.io.b.Utf8StringBenchmark.benchmark encoding='bad', length=200000 793.557 us ±6.575 us 877.934 us ±13.025 us -10.6 % 951.408 us ±8.023 us -19.9 % 667.837 us ±1.239 us 15.8 %

segments-public-api branch performs better, or at least not worse, in almost all cases, except string encoding/decoding. In these cases, the slowdown was mostly caused by switching from direct indexation into Segment's array to indirect indexation (where use passes a logical index into a Segment array's span with data and accessor methods adds limit/pos to it; i.e. was fun get(idx) = data[idx], became fun get(idx) = data[pos + idx]). That is something that could be reverted back to direct indexation in exchange to ease of API use.

Unfortunately, the dbb-benchmarking branch showed the significant performance drop in almost all scenarios where Segment's data was accessed at shorter-than-int granularity. There are various factors affecting that result like need to perform type checks on every call inlined ByteBuffer methods (to ensure that a receiver is an instance of DirectByteBuffer), range checks requiring access to byte buffer's state, more code generated for every segment access (for utf8-string encoding it increases registers pressure in JIT-compiled code and leads to more spills/fills emitted).

To shrink the performance gap between byte array and byte buffer based implementations I tried to use Unsafe for accessing a memory region assigned to a DirectByteBuffer (the private/dbb-benchmarking-unsafe). The use of the Unsafe allows to bypass type checks (target type is a Long with an address) and range checks (it's unsafe, right? :) and results it better performance for string encoding/decoding cases (in some cases it now outperforms develop branch). I concentrated on string ops performance and didn't check IndexOf-methods, thus its performance remained poor in that branch.

Android results

For android, I ported most of the core benchmarks to androidx-benchmark (android-benchmarks branch forked from develop), forked corresponding "JVM"-branches and merged androidx-branch into each of them (private/segments-public-api-android and private/dbb-benchmarking-android branches).

Below are results gathered from a device:

Benchmark Parameters Baseline, avg. time 0.999-CI Seg. public API, avg. time 0.999-CI Improvement DirectBB, avg. time 0.999-CI Improvement
kx.io.b.a.ByteArrayReadWriteBenchmarks.readWriteByteArray size=1 102.622 ns ±0.060 ns 105.319 ns ±0.067 ns -2.6 % 262.792 ns ±0.620 ns -156.1 %
kx.io.b.a.ByteArrayReadWriteBenchmarks.readWriteByteArray size=1024 330.932 ns ±0.471 ns 331.846 ns ±0.938 ns N/A 469.693 ns ±1.693 ns -41.9 %
kx.io.b.a.ByteArrayReadWriteBenchmarks.readWriteByteArray size=24576 4.595 us ±0.036 us 4.852 us ±0.041 us -5.6 % 6.976 us ±0.057 us -51.8 %
kx.io.b.a.ByteArrayReadWriteBenchmarks.readWriteNewByteArray size=1 188.783 ns ±7.791 ns 190.824 ns ±6.285 ns N/A 365.952 ns ±22.233 ns -93.8 %
kx.io.b.a.ByteArrayReadWriteBenchmarks.readWriteNewByteArray size=1024 2.900 us ±0.058 us 2.953 us ±0.046 us N/A 3.351 us ±0.062 us -15.6 %
kx.io.b.a.ByteArrayReadWriteBenchmarks.readWriteNewByteArray size=24576 35.271 us ±0.196 us 35.465 us ±0.310 us N/A 38.625 us ±0.538 us -9.5 %
kx.io.b.a.DecimalLongBenchmark.decLongRW value='-9223372036854775806' 750.906 ns ±0.505 ns 381.942 ns ±0.165 ns 49.1 % 1034.825 ns ±8.274 ns -37.8 %
kx.io.b.a.DecimalLongBenchmark.decLongRW value='9223372036854775806' 760.751 ns ±0.397 ns 406.559 ns ±0.214 ns 46.6 % 1073.903 ns ±0.721 ns -41.2 %
kx.io.b.a.DecimalLongBenchmark.decLongRW value='1' 124.628 ns ±0.064 ns 118.894 ns ±0.116 ns 4.6 % 185.450 ns ±0.111 ns -48.8 %
kx.io.b.a.HexadecimalLongBenchmark.hexLongRW value='9223372036854775806' 544.825 ns ±0.293 ns 279.604 ns ±0.126 ns 48.7 % 704.703 ns ±0.818 ns -29.3 %
kx.io.b.a.HexadecimalLongBenchmark.hexLongRW value='1' 163.418 ns ±0.076 ns 218.349 ns ±0.141 ns -33.6 % 219.523 ns ±0.157 ns -34.3 %
kx.io.b.a.IndexOfBenchmark.indexOf params='128:0:-1' 341.251 ns ±0.236 ns 334.579 ns ±0.111 ns 2.0 % 1899.041 ns ±10.569 ns -456.5 %
kx.io.b.a.IndexOfBenchmark.indexOf params='128:0:7' 57.001 ns ±0.036 ns 48.672 ns ±0.173 ns 14.6 % 150.282 ns ±0.083 ns -163.6 %
kx.io.b.a.IndexOfBenchmark.indexOf params='128:0:100' 274.937 ns ±0.123 ns 267.822 ns ±0.053 ns 2.6 % 1500.502 ns ±1.361 ns -445.8 %
kx.io.b.a.IndexOfBenchmark.indexOf params='128:8128:100' 299.373 ns ±0.116 ns 282.945 ns ±0.165 ns 5.5 % 1528.276 ns ±1.129 ns -410.5 %
kx.io.b.a.IndexOfBenchmark.indexOf params='24576:0:-1' 57.637 us ±0.023 us 57.624 us ±0.046 us N/A 355.938 us ±0.144 us -517.6 %
kx.io.b.a.IndexOfByteString.indexOf params='1024:2' 15.666 us ±0.103 us 7.041 us ±0.031 us 55.1 % 38.340 us ±0.195 us -144.7 %
kx.io.b.a.IndexOfByteString.indexOf params='8192:2' 124.209 us ±0.239 us 54.966 us ±0.024 us 55.7 % 305.591 us ±0.179 us -146.0 %
kx.io.b.a.IndexOfByteString.indexOf params='10000:2' 150.761 us ±0.120 us 66.998 us ±0.028 us 55.6 % 372.751 us ±0.290 us -147.2 %
kx.io.b.a.IndexOfByteString.indexOf params='10000:8' 250.041 us ±1.042 us 148.644 us ±0.065 us 40.6 % 852.738 us ±1.098 us -241.0 %
kx.io.b.a.IntegerValuesBenchmark.byteRW 36.208 ns ±0.015 ns 25.294 ns ±0.024 ns 30.1 % 53.608 ns ±0.021 ns -48.1 %
kx.io.b.a.IntegerValuesBenchmark.intRW 43.176 ns ±0.015 ns 33.737 ns ±0.015 ns 21.9 % 55.921 ns ±0.020 ns -29.5 %
kx.io.b.a.IntegerValuesBenchmark.intLeRW 42.785 ns ±0.031 ns 33.179 ns ±0.024 ns 22.5 % 54.095 ns ±0.027 ns -26.4 %
kx.io.b.a.IntegerValuesBenchmark.longLeRW 56.786 ns ±0.042 ns 49.917 ns ±0.486 ns 12.1 % 64.354 ns ±0.067 ns -13.3 %
kx.io.b.a.IntegerValuesBenchmark.longRW 46.210 ns ±0.024 ns 40.503 ns ±0.052 ns 12.3 % 56.023 ns ±0.036 ns -21.2 %
kx.io.b.a.IntegerValuesBenchmark.shortLeRW 41.546 ns ±0.023 ns 27.153 ns ±0.024 ns 34.6 % 55.189 ns ±0.025 ns -32.8 %
kx.io.b.a.IntegerValuesBenchmark.shortRW 42.347 ns ±0.030 ns 26.361 ns ±0.020 ns 37.8 % 57.663 ns ±0.064 ns -36.2 %
kx.io.b.a.Utf8LineBenchmarks.readLine length=17, separator='LF' 783.308 ns ±12.860 ns 787.348 ns ±14.717 ns N/A 1596.158 ns ±115.792 ns -103.8 %
kx.io.b.a.Utf8LineBenchmarks.readLine length=17, separator='CRLF' 829.941 ns ±12.269 ns 843.887 ns ±36.394 ns N/A 1673.793 ns ±79.104 ns -101.7 %
kx.io.b.a.Utf8LineBenchmarks.readLineStrict length=17, separator='LF' 780.764 ns ±13.163 ns 802.789 ns ±40.353 ns N/A 1609.004 ns ±125.967 ns -106.1 %
kx.io.b.a.Utf8LineBenchmarks.readLineStrict length=17, separator='CRLF' 893.010 ns ±18.179 ns 844.291 ns ±16.967 ns 5.5 % 1675.612 ns ±32.163 ns -87.6 %
kx.io.b.a.Utf8Benchmark.readWriteString encoding='ascii', length=20 648.041 ns ±13.030 ns 681.464 ns ±12.634 ns -5.2 % 1297.408 ns ±34.402 ns -100.2 %
kx.io.b.a.Utf8Benchmark.readWriteString encoding='ascii', length=2000 28.898 us ±0.432 us 29.776 us ±0.432 us -3.0 % 86.367 us ±0.979 us -198.9 %
kx.io.b.a.Utf8Benchmark.readWriteString encoding='ascii', length=200000 2.733 ms ±0.034 ms 2.777 ms ±0.033 ms N/A 6.401 ms ±0.080 ms -134.2 %
kx.io.b.a.Utf8Benchmark.readWriteString encoding='utf8', length=20 1.047 us ±0.014 us 1.092 us ±0.017 us -4.3 % 2.078 us ±0.109 us -98.5 %
kx.io.b.a.Utf8Benchmark.readWriteString encoding='utf8', length=2000 83.576 us ±2.004 us 86.883 us ±1.699 us N/A 189.267 us ±1.966 us -126.5 %
kx.io.b.a.Utf8Benchmark.readWriteString encoding='utf8', length=200000 8.096 ms ±0.162 ms 8.117 ms ±0.140 ms N/A 15.218 ms ±0.214 ms -88.0 %
kx.io.b.a.Utf8Benchmark.readWriteString encoding='sparse', length=20 752.265 ns ±17.496 ns 788.823 ns ±16.983 ns -4.9 % 1497.804 ns ±44.346 ns -99.1 %
kx.io.b.a.Utf8Benchmark.readWriteString encoding='sparse', length=2000 28.831 us ±0.568 us 29.218 us ±0.592 us N/A 88.670 us ±1.441 us -207.6 %
kx.io.b.a.Utf8Benchmark.readWriteString encoding='sparse', length=200000 2.692 ms ±0.045 ms 2.658 ms ±0.032 ms N/A 6.387 ms ±0.075 ms -137.3 %
kx.io.b.a.Utf8Benchmark.readWriteString encoding='2bytes', length=20 1.159 us ±0.022 us 1.234 us ±0.026 us -6.4 % 2.316 us ±0.072 us -99.9 %
kx.io.b.a.Utf8Benchmark.readWriteString encoding='2bytes', length=2000 79.157 us ±2.158 us 77.157 us ±1.695 us N/A 169.439 us ±1.949 us -114.1 %
kx.io.b.a.Utf8Benchmark.readWriteString encoding='2bytes', length=200000 7.409 ms ±0.149 ms 7.276 ms ±0.145 ms N/A 13.839 ms ±0.167 ms -86.8 %
kx.io.b.a.Utf8Benchmark.readWriteString encoding='3bytes', length=20 1.413 us ±0.026 us 1.449 us ±0.030 us N/A 3.224 us ±0.127 us -128.1 %
kx.io.b.a.Utf8Benchmark.readWriteString encoding='3bytes', length=2000 99.928 us ±1.888 us 96.950 us ±1.831 us N/A 227.922 us ±2.983 us -128.1 %
kx.io.b.a.Utf8Benchmark.readWriteString encoding='3bytes', length=200000 9.156 ms ±0.114 ms 9.142 ms ±0.174 ms N/A 19.725 ms ±0.269 ms -115.4 %
kx.io.b.a.Utf8Benchmark.readWriteString encoding='4bytes', length=20 1.034 us ±0.016 us 1.100 us ±0.061 us N/A 2.347 us ±0.157 us -127.0 %
kx.io.b.a.Utf8Benchmark.readWriteString encoding='4bytes', length=2000 64.705 us ±1.421 us 65.686 us ±1.500 us N/A 172.299 us ±2.715 us -166.3 %
kx.io.b.a.Utf8Benchmark.readWriteString encoding='4bytes', length=200000 6.099 ms ±0.086 ms 6.021 ms ±0.095 ms N/A 13.489 ms ±0.163 ms -121.2 %
kx.io.b.a.Utf8Benchmark.readWriteString encoding='bad', length=20 991.498 ns ±36.972 ns 1035.136 ns ±52.328 ns N/A 1536.603 ns ±136.426 ns -55.0 %
kx.io.b.a.Utf8Benchmark.readWriteString encoding='bad', length=2000 66.600 us ±1.704 us 68.397 us ±2.087 us N/A 111.606 us ±2.014 us -67.6 %
kx.io.b.a.Utf8Benchmark.readWriteString encoding='bad', length=200000 6.618 ms ±0.125 ms 6.864 ms ±0.169 ms N/A 9.545 ms ±0.104 ms -44.2 %

Results suggests that switching to direct byte buffers on Android would lead to a significant performances drop.

Collected results are not in the byte buffers favor (especially on Android), however it might not be as bad as it seems in a context of some particular application. Also, these results correspond to kotlinx.io.Buffer performance and as it was showed previously, direct byte buffers show some performance improvement when it comes to I/O operations.

To check these two statements, I added kotlinx-io support to kotlinx.serialization (to its fork: https://github.com/fzhinkin/kotlinx.serialization) and added benchmarks to see how well kotlinx-io performs in JSON-serialization scenarios (and scenarios where these serialized data is then sent to a file).

It would be fare to blame me for checking one of the worst performing scenarios (string encoding), but JSON is an extremely popular serialization format and its crucial to show good results when using kotlinx-io in the context for JSON-serialization.

Below are results collected for both JVM and Android (Subset of serialization benchmarks ported to androidx-benchmark: https://github.com/fzhinkin/kotlinx-serialization-android-benchmarks) by running benchmarks against aforementioned branches (in fact, there were 4 separate branches where utf8-code-point writing was made public: private/dev-for-serialization, private/public-segments-api-for-serialization, private/dbb-benchmarking-for-serialization and private/dbb-benchmarking-unsafe-for-serialization).

JVM results

Benchmark Baseline, avg. time 0.999-CI Seg. public API, avg. time 0.999-CI Improvement DirectBB, avg. time 0.999-CI Improvement DirectBB w/ Unsafe, avg. time 0.999-CI Improvement
k.b.j.CitmBenchmark.encodeCitmKotlinxIo 3.012 ms ±0.040 ms 2.711 ms ±0.064 ms 10.0 % 3.587 ms ±0.867 ms N/A 2.752 ms ±0.056 ms 8.7 %
k.b.j.CitmBenchmark.encodeCitmKotlinxIoFile 3.024 ms ±0.045 ms 2.816 ms ±0.105 ms 6.9 % 3.691 ms ±0.828 ms N/A 2.768 ms ±0.042 ms 8.5 %
k.b.j.CitmBenchmark.encodeCitmKotlinxIoFileChannel 2.806 ms ±0.039 ms 2.525 ms ±0.033 ms 10.0 % 4.012 ms ±0.101 ms -43.0 % 2.461 ms ±0.035 ms 12.3 %
k.b.j.CitmBenchmark.encodeCitmKotlinxIoileChannel 2.770 ms ±0.046 ms 2.532 ms ±0.054 ms 8.6 % 3.374 ms ±0.816 ms N/A 2.486 ms ±0.042 ms 10.3 %
k.b.j.JacksonComparisonBenchmark.kotlinSmallToKotlinxIo 191.203 ns ±2.059 ns 198.177 ns ±3.050 ns -3.6 % 195.991 ns ±22.415 ns N/A 172.998 ns ±17.318 ns N/A
k.b.j.JacksonComparisonBenchmark.kotlinToKotlinxIo 1.826 us ±0.038 us 1.640 us ±0.043 us 10.2 % 1.622 us ±0.066 us 11.2 % 1.765 us ±0.061 us N/A
k.b.j.JacksonComparisonBenchmark.kotlinToKotlinxIoFile 2.374 us ±0.035 us 2.139 us ±0.044 us 9.9 % 2.028 us ±0.034 us 14.6 % 2.093 us ±0.040 us 11.8 %
k.b.j.JacksonComparisonBenchmark.kotlinToKotlinxIoFileChannel 1.959 us ±0.059 us 1.927 us ±0.007 us N/A 1.856 us ±0.028 us 5.2 % 1.916 us ±0.019 us N/A
k.b.j.TwitterBenchmark.encodeTwitterKotlinxIo 147.486 us ±2.530 us 130.995 us ±2.066 us 11.2 % 147.387 us ±7.603 us N/A 137.207 us ±1.966 us 7.0 %
k.b.j.TwitterBenchmark.encodeTwitterKotlinxIoFile 143.767 us ±0.698 us 137.564 us ±2.993 us 4.3 % 144.946 us ±7.920 us N/A 135.599 us ±2.992 us 5.7 %
k.b.j.TwitterBenchmark.encodeTwitterKotlinxIoFileChannel 134.166 us ±2.064 us 124.670 us ±2.548 us 7.1 % 130.584 us ±5.718 us N/A 126.153 us ±1.929 us 6.0 %
k.b.j.TwitterFeedBenchmark.encodeTwitterKotlinxIo 2.064 ms ±0.215 ms 1.916 ms ±0.209 ms N/A 2.823 ms ±0.570 ms N/A 2.023 ms ±0.351 ms N/A
k.b.j.TwitterFeedBenchmark.encodeTwitterKotlinxIoFile 1.914 ms ±0.042 ms 1.894 ms ±0.125 ms N/A 2.593 ms ±0.891 ms N/A 1.743 ms ±0.036 ms 9.0 %
k.b.j.TwitterFeedBenchmark.encodeTwitterKotlinxIoFileChannel 1.813 ms ±0.028 ms 1.632 ms ±0.031 ms 10.0 % 2.600 ms ±0.726 ms -43.5 % 1.683 ms ±0.049 ms 7.2 %

Android results

Benchmark Baseline, avg. time 0.999-CI Seg. public API, avg. time 0.999-CI Improvement DirectBB, avg. time 0.999-CI Improvement
o.e.Benchmarks.citm 26.505 ms ±1.471 ms 26.740 ms ±1.812 ms N/A 35.307 ms ±2.286 ms -33.2 %
o.e.Benchmarks.citmFile 27.038 ms ±1.647 ms 27.048 ms ±1.709 ms N/A 35.528 ms ±2.409 ms -31.4 %
o.e.Benchmarks.citmFileChannel 26.383 ms ±1.389 ms 25.932 ms ±1.253 ms N/A 33.768 ms ±2.132 ms -28.0 %
o.e.Benchmarks.twitterMacro 10.769 ms ±0.456 ms 11.089 ms ±0.460 ms N/A 18.691 ms ±0.581 ms -73.6 %
o.e.Benchmarks.twitterMacroFile 11.399 ms ±0.431 ms 11.652 ms ±0.463 ms N/A 19.281 ms ±0.679 ms -69.2 %
o.e.Benchmarks.twitterMacroFileChannel 11.918 ms ±0.582 ms 11.847 ms ±0.693 ms N/A 19.183 ms ±1.039 ms -60.9 %
o.e.Benchmarks.twitter 915.832 us ±40.304 us 926.380 us ±40.005 us N/A 1546.619 us ±51.164 us -68.9 %
o.e.Benchmarks.twitterFile 954.142 us ±41.837 us 971.910 us ±39.710 us N/A 1547.613 us ±56.647 us -62.2 %
o.e.Benchmarks.twitterFileChannel 854.305 us ±38.670 us 851.514 us ±37.706 us N/A 1423.378 us ±42.370 us -66.6 %

On JVM, byte buffer-backed segments performs better only in conjunction with Unsafe-access (and that's a separate topic to discuss), without it there are some scenarios where it's be better as well as scenarios where it's worse.
On Android, everything is much easier: byte buffers are always worse, even if the only non-byte buffer based solution is to copy the data (like in *FileChannel benchmarks).

[Instead of] Conclusion

I don't have a particular conclusion about direct byte buffers use on JVM as to squeeze the max performance from it, we have to use unsafe (the sun.misc/jdk.internal one) and its future in JDK is not that bright (and I was not able to beat ByteBuffers with MemorySegments created from it).

For the Android, it seems like there are no benefits from switching to ByteBuffer even though buffer-based I/O (via NIO channels) seems to be much faster compared to I/O operations involving heap-residing containers (but use of off-heap data may still have some benefits).

@fzhinkin
Copy link
Collaborator Author

fzhinkin commented Feb 1, 2024

I've also checked if having multiple segment types will affect the performance if only the one type is actually in use (the assumption is that at least the JVM will employ the CHA to avoid redundant type checks).

There's a branch (that won't compile to any target except JVM) private/polymorphic-segments where Segment was turned into an abstract class with two implementations - one with ByteArray inside (based on the private/segments-public-api branch) and another with the ByteBuffer inside (based on private/dbb-benchmarking branch). For the benchmarking purposes, ByteBuffer-backed segments were never loaded during the experiments (verified with class loading logs).

I won't post a large table as above, will just briefly summarize results:

  • on JVM, there is no significant difference between results gathered for private/segments-public-api and private/polymorphic-segments branches: that's good, presence of ByteBuffer-backed segments won't affect those who don't need them;
  • on Android, the situation is different: use of polymorphic segments makes performance worse. It's true even w/ R8 applied with a config that allows treating ByteBuffer-backed segments allocation path unreachable.

JVM benchmarking results are here and Android benchmarking results are here.

@fzhinkin
Copy link
Collaborator Author

fzhinkin commented May 7, 2024

With all that being said about the performance aspect of ByteBuffers support, it's also worth mentioning that ByteBuffers on JVM and native-pointer-based segments on native would help with supporting memory-mapped files.
With array-backed segments only, memory-mapped files would require an additional class/interface.
With polymorphic segments, we could (not without caveats) wrap a ByteBuffer or mmaped ptr into a segment as a whole.

@fzhinkin
Copy link
Collaborator Author

Probably we can support ByteBuffers on JVM without hurting performance on Android by publishing a multi-release jar with a baseline implementation remaining the same (byte-array backed) but with polymorphic segments and BB-support enabled for, let's say, JDK9 and onwards. Android tooling ignores MRJ-stuff while dexing, so the trick might work. 👿

I don't think it's a solution we should/could stick to, but that could solve an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants