API ergonomics surrounding unsigned types

After working with TileDB, in Java, for a couple of biological imaging projects I wanted to bring up the issue of developer/API ergonomics surrounding unsigned types on the JVM.  Specifically, the fact that the type of arrays going in and out of `NativeArray` instances of unsigned 8 and 16 bit data are of `short[]` and `int[]` types respectively.  This has also been mentioned previously (#76) in the context of unsigned 64 bit arrays and the corresponding array type consistency (`long` vs `BigInteger`).

This typecasting "up" of unsigned arrays has two primary issues in my opinion:

1. At least a doubling of the in-memory size of the resultant Java array following interactions with TileDB
2. Conversion overhead when integrating with basic Java I/O and NIO APIs, and other image processing and computer vision APIs from Java such as N5 [1], OpenCV [2], Fiji [3], etc.

Where (1) is concerned, we are regularly processing multi-terabyte arrays and often have several gigabytes of array data in memory during such processing.  To double the memory requirements of every array we are working with due to an API choice is at best unfortunate.  Furthermore, due to (2) these arrays don't end up lasting in memory very long adding to GC pressure.  The basic Java I/O and NIO APIs primarily take `byte[]` or `ByteBuffer`.  The image processing and computer vision APIs predominantly expect `byte[]` and type, `short[]` and type, etc. to be specified separately.

I certainly understand the positions "Java doesn't have unsigned types" and "Java doesn't support array typecasts like C/C++".  However, practically the language has added a number of features to the standard library, starting with NIO's `ByteBuffer` in 1.4 and culminating with the `toUnsigned*()` method additions in 8, over time to cater to unsigned workflows and at least typecasting "up" from `byte[]`.

This dissonance has lead us to flat out not specify unsigned types explicitly when using TileDB in Java but instead handle them implicitly and use signed types.  This in itself has two fairly negative consequences:

1. Fill values are less than ideal
2. Implicit typing has to be communicated to other TileDB language runtimes such as Python

(1) can probably be mitigated with a resolution to TileDB-Inc/TileDB#518 being implemented.  For biological imaging use cases a non-configurable fill value can be problematic for a number of other reasons.  However, the implicit type communication is problematic and definitely complicates cross language use when we are `astype()`ing numpy arrays everywhere.

I don't know how wedded current users of TileDB in other domains are to the way `NativeArray` functions.  At the risk of sounding drastic I'd like to first understand why we shouldn't (can't?) rely on the caller to handle the unsignedness of the array and use `short[]` for `uint8`, `int[]` for `uint16`, etc. when working with `NativeArray`.

---

1. https://github.com/saalfeldlab/n5
2. https://opencv.org/
3. https://imagej.net/Fiji

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

API ergonomics surrounding unsigned types #177

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

API ergonomics surrounding unsigned types #177

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions