Add Tensor.copyDataInto for allocation-free output reads on Android by jgibson2 · Pull Request #8 · PolyCam/executorch

jgibson2 · 2026-04-29T15:26:36Z

Summary

Adds Tensor.copyDataInto(<TypedBuffer>) to the Android Java API for every dtype, mirroring the version up for review on pytorch/executorch#19171. Eliminates the per-call float[] / int[] / etc. allocation that getDataAs*Array() performs by letting the caller supply (and reuse) a destination buffer.

Motivation

Profiling on Android (Perfetto) showed output.toTensor().dataAsFloatArray driving substantial ART GC pressure in the depth-inference loop. Each call allocates a fresh Java float[] sized to the tensor's element count and bulk-copies from the underlying off-heap buffer into it. For a model with two [1, 1, 144, 192] fp32 outputs that's ~110 KB × 2 = 220 KB of Java-heap churn per inference. At 2 fps that's 440 KB/sec just from reading outputs — visibly enough to add several young-generation GCs per second, and ~14% of the inference wall time was being lost to run-queue contention as GC suspended app threads.

The native side already exposes the underlying off-heap FloatBuffer as a zero-copy view of the C++ tensor's data_ptr() (via the package-private getRawDataBuffer). The new copyDataInto API gives external callers a public way to drain that buffer into a destination they own and reuse across calls.

API

public void copyDataInto(FloatBuffer dst)        // float32 + float16 (widening)
public void copyDataInto(IntBuffer dst)          // int32
public void copyDataInto(LongBuffer dst)         // int64
public void copyDataInto(DoubleBuffer dst)       // float64
public void copyDataInto(ShortBuffer dst)        // float16 raw bits
public void copyDataInto(ByteBuffer dst)         // int8
public void copyDataIntoUnsigned(ByteBuffer dst) // uint8

Float16 has both a FloatBuffer overload (per-element half→float widening) and a ShortBuffer overload (raw fp16 bits, no widening).
int8 / uint8 are split into separate methods to mirror the existing getDataAsByteArray vs getDataAsUnsignedByteArray distinction — calling the wrong one throws.
All overloads write at the destination's current position and advance it (standard Buffer.put semantics).
The fp16 → fp32 widening path includes an upfront dst.remaining() capacity check so an undersized destination throws BufferOverflowException before any partial widening is observed (matching the all-or-nothing semantics of the bulk-put paths).

Caller-side usage

// One-time setup
FloatBuffer depthBuf = Tensor.allocateFloatBuffer(numelDepth);

// Per inference
EValue[] outputs = module.forward(...);
depthBuf.rewind();
outputs[0].toTensor().copyDataInto(depthBuf);   // no allocation
// ... read from depthBuf ...

Test plan

TensorTest.kt covers each variant's happy path, asymmetric int8/uint8 rejection, fp16 raw-bits vs widened paths, position-respecting writes, and BufferOverflowException on undersized destinations (including the fp16 widening path's upfront check).
Tensor.java compiles standalone with javac 17.
google-java-format (1.23.0, project standard) and ktfmt (Meta style, matches spotless { kotlin { ktfmt() } } config) both clean.
Verify against polycam EstimateDepth.kt (forthcoming change to swap dataAsFloatArray for copyDataInto + a 2-slot encoder pool).

🤖 Generated with Claude Code

…put reads The existing Tensor.getDataAsFloatArray() allocates a fresh float[] on every call and copies from the underlying off-heap buffer into it. In a steady-state inference loop this is a per-frame allocation proportional to the output tensor size — for a 144x192 single-channel depth output that's ~110 KB per call, multiplied by however many output tensors a model returns. Profiling on Android (Perfetto) showed this driving substantial ART GC pressure that ended up stealing CPU back from the inference thread itself via run-queue contention. The off-heap FloatBuffer that backs the output Tensor is already owned by the native side and exposed internally via the package-private getRawDataBuffer(). copyDataInto(FloatBuffer dst) lets callers reuse a pre-allocated destination across calls, eliminating the per-call float[] allocation while keeping ownership unambiguous (the destination is caller-owned, so it's safe to hand off to async consumers without racing against the next forward() overwriting native memory). Implemented for float32 (zero-copy bulk put) and float16 (per-element half->float widening, mirroring getDataAsFloatArray). Other dtypes inherit the base-class IllegalStateException default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Extends the previous commit (which added copyDataInto(FloatBuffer) for float32 / float16) to cover every dtype that the Tensor class supports: copyDataInto(ByteBuffer) int8 copyDataIntoUnsigned(ByteBuffer) uint8 copyDataInto(IntBuffer) int32 copyDataInto(LongBuffer) int64 copyDataInto(DoubleBuffer) float64 copyDataInto(ShortBuffer) float16 (raw fp16 bits, no widening) Mirrors the asymmetry of getDataAsByteArray (int8) vs getDataAsUnsignedByteArray (uint8) — calling copyDataInto(ByteBuffer) on a uint8 tensor throws, and copyDataIntoUnsigned(ByteBuffer) on int8 throws, just like the array accessors. The two methods are intentionally separate even though the underlying bits are identical, so a misuse caused by a dtype switch surfaces as an exception instead of silently producing values with the wrong sign interpretation. The float16 ShortBuffer overload writes the raw 16-bit half-precision bits without widening; copyDataInto(FloatBuffer) on the same tensor still performs the half->float widening, matching the dual getDataAsShortArray / getDataAsFloatArray accessors that float16 already exposes. Tests cover each new variant's happy path plus the asymmetric int8/uint8 rejection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

meredithbayne · 2026-04-29T17:36:55Z

Nice improvements!

jgibson2 and others added 2 commits April 29, 2026 11:25

meredithbayne approved these changes Apr 29, 2026

View reviewed changes

jgibson2 merged commit 0783572 into polycam Apr 29, 2026

jgibson2 deleted the jgibson/android-tensor-copyDataInto-into-polycam branch April 29, 2026 17:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Tensor.copyDataInto for allocation-free output reads on Android#8

Add Tensor.copyDataInto for allocation-free output reads on Android#8
jgibson2 merged 2 commits intopolycamfrom
jgibson/android-tensor-copyDataInto-into-polycam

jgibson2 commented Apr 29, 2026

Uh oh!

meredithbayne commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jgibson2 commented Apr 29, 2026

Summary

Motivation

API

Caller-side usage

Test plan

Uh oh!

meredithbayne commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants