Skip to content

API ergonomics surrounding unsigned types #177

@chris-allan

Description

@chris-allan

After working with TileDB, in Java, for a couple of biological imaging projects I wanted to bring up the issue of developer/API ergonomics surrounding unsigned types on the JVM. Specifically, the fact that the type of arrays going in and out of NativeArray instances of unsigned 8 and 16 bit data are of short[] and int[] types respectively. This has also been mentioned previously (#76) in the context of unsigned 64 bit arrays and the corresponding array type consistency (long vs BigInteger).

This typecasting "up" of unsigned arrays has two primary issues in my opinion:

  1. At least a doubling of the in-memory size of the resultant Java array following interactions with TileDB
  2. Conversion overhead when integrating with basic Java I/O and NIO APIs, and other image processing and computer vision APIs from Java such as N5 [1], OpenCV [2], Fiji [3], etc.

Where (1) is concerned, we are regularly processing multi-terabyte arrays and often have several gigabytes of array data in memory during such processing. To double the memory requirements of every array we are working with due to an API choice is at best unfortunate. Furthermore, due to (2) these arrays don't end up lasting in memory very long adding to GC pressure. The basic Java I/O and NIO APIs primarily take byte[] or ByteBuffer. The image processing and computer vision APIs predominantly expect byte[] and type, short[] and type, etc. to be specified separately.

I certainly understand the positions "Java doesn't have unsigned types" and "Java doesn't support array typecasts like C/C++". However, practically the language has added a number of features to the standard library, starting with NIO's ByteBuffer in 1.4 and culminating with the toUnsigned*() method additions in 8, over time to cater to unsigned workflows and at least typecasting "up" from byte[].

This dissonance has lead us to flat out not specify unsigned types explicitly when using TileDB in Java but instead handle them implicitly and use signed types. This in itself has two fairly negative consequences:

  1. Fill values are less than ideal
  2. Implicit typing has to be communicated to other TileDB language runtimes such as Python

(1) can probably be mitigated with a resolution to TileDB-Inc/TileDB#518 being implemented. For biological imaging use cases a non-configurable fill value can be problematic for a number of other reasons. However, the implicit type communication is problematic and definitely complicates cross language use when we are astype()ing numpy arrays everywhere.

I don't know how wedded current users of TileDB in other domains are to the way NativeArray functions. At the risk of sounding drastic I'd like to first understand why we shouldn't (can't?) rely on the caller to handle the unsignedness of the array and use short[] for uint8, int[] for uint16, etc. when working with NativeArray.


  1. https://github.com/saalfeldlab/n5
  2. https://opencv.org/
  3. https://imagej.net/Fiji

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions